Open nfraprado opened 1 month ago
Bumped into the same issue when checking preliminary results for the watchdog reset test: https://lava.collabora.dev/scheduler/job/14140804#L10766
This is another instance where missing characters can lead to a false regression (here /sys/class/watchdog0/timeout
became /sys/class/wathdog0/timeout
). We should consider adding test_character_delay
in the respective device type templates to see if it helps mitigating the issue. @nfraprado, based on your experience, does this only affect the spherion and tomato devices?
Maybe adding somewhere shell script:
if [ ! -e /sys/class/watchdog0/timeout ]; then
echo ERR-NODEV
echo ERR-NODEV
echo ERR-NODEV
fi
and triggering on shorter string "ERR-NODEV" (that also repeats, to increase chance it wont be mixed with some other stuff)?
@nuclearcat We need to fix the serial because we won't be able to change the code for the upstream tests, and we wouldn't want them to have to handle a flaky serial anyway.
@laura-nao So far I've only seen it happen on these two platforms, but I'll add updates here if I see it elsewhere.
I think adding test_character_delay
for these platforms will help, but every instance of this issue I've seen so far happened while the serial was still printing a lot of output, so I feel like adding an initial delay would also help a lot. The suggestion I got was to run udevadm settle
to wait for the activity on the device to settle down, but that command will be run on the device and so it could itself get mangled... Ideally we would define something in the LAVA template that makes LAVA wait for the serial to settle down before continuing, but I don't know if anything like that exists.
I think trogdor also have flaky serial
@nuclearcat right, I remember trogdor has flaky serial, but I think it was the output, not the input, and in that case the delay wouldn't help. But I'll check it and the other platforms.
For now I've created a PR to for us to test if this does fix it on spherion and tomato: https://github.com/kernelci/kernelci-pipeline/pull/626.
I noticed the input overrun
issue on Spherion only happened on baseline
, never on baseline-nfs
, meaning that instead of the character delay, an initial delay would probably also solve the issue. In any case, since the PR with the character delay was opened, there hasn't been a failure yet, so it seems to be working so far. But let's give it a bit more time.
The spherion and tomato platforms experience serial issues sometimes. Example on Spherion: https://lava.collabora.dev/scheduler/job/13962478
This issue happens at random, and by just re-running the job I got it to work: https://lava.collabora.dev/scheduler/job/13995464
Log snippet:
After that LAVA reports
lava-test-shell timed out after 60 seconds
.As can be seen from the log, the
lava-test-runner
command gets messed up, and the shell replies withnot found
. There's also attyS ttyS0: 1 input overrun(s)
message, which is worrying.Example on Tomato: https://lava.collabora.dev/scheduler/job/13962479
It looks slightly different. It seems that the shell prompt itself got split, so LAVA didn't recognize it
And the LAVA error message is
wait for prompt timed out
.Researching a bit I found this LAVA documentation page, which suggests setting
boot_character_delay
and/ortest_character_delay
as ways to avoid missing characters in the serial when the device can't keep up.Also, when the login prompt is reached, there are still many messages being printed by the kernel, which might be interfering, so another idea would be to add a delay so LAVA only sends the command once the serial output has settled down.