kernelci / kernelci-project

KernelCI Linux Foundation project documentation
9 stars 25 forks source link

Serial can't keep up on spherion and tomato (MediaTek Chromebooks) #366

Open nfraprado opened 1 month ago

nfraprado commented 1 month ago

The spherion and tomato platforms experience serial issues sometimes. Example on Spherion: https://lava.collabora.dev/scheduler/job/13962478

This issue happens at random, and by just re-running the job I got it to work: https://lava.collabora.dev/scheduler/job/13995464

Log snippet:

/ # /lava-13962478/bin/lava-test-runner /lava-13962478/0
[    9.705822] hub 1-1.4:1.0: USB hub found
/lava-13962478/bin/lava-tes[    9.706251] hub 1-1.4:1.0: 2 ports detected
[    9.782304] r8152-cfgselector 2-1.3: reset SuperSpeed USB device number 4 using xhci-mtk
[    9.811444] r8152 2-1.3:1.0: Direct firmware load for rtl_nic/rtl8153a-4.fw failed with error -2
[    9.811497] r8152 2-1.3:1.0: unable to load firmware patch rtl_nic/rtl8153a-4.fw (-2)
[    9.844050] r8152 2-1.3:1.0 eth0: v1.12.13
[    9.844408] r8152-cfgselector 2-1.3: USB disconnect, device number 4
[    9.919491] Console: switching to colour frame buffer device 240x67
t-runa-13962478/[    9.953121] ttyS ttyS0: 1 input overrun(s)
[   10.006751] mediatek-drm mediatek-drm.11.auto: [drm] fb0: mediatekdrmfb frame buffer device
0
/bin/sh: /lava-13962478/bin/lava-test-runa-13962478/0: not found
/ # [   10.034163] usb 2-1: reset SuperSpeed USB device number 2 using xhci-mtk

After that LAVA reports lava-test-shell timed out after 60 seconds.

As can be seen from the log, the lava-test-runner command gets messed up, and the shell replies with not found. There's also a ttyS ttyS0: 1 input overrun(s) message, which is worrying.

Example on Tomato: https://lava.collabora.dev/scheduler/job/13962479

It looks slightly different. It seems that the shell prompt itself got split, so LAVA didn't recognize it

. /lava-13962479[   10.704192] mtk-mdp3 14001000.dma-controller: can't get SCP node
/environment
/ [   10.711195] mtk-mdp3 14001000.dma-controller: Driver registered as /dev/video2
[   10.711728] mtk-mdp3 1400c000.dma-controller: Adding to iommu group 1
# [   10.726424] mtk-mdp3 14f08000.dma-controller: Adding to iommu group 1

And the LAVA error message is wait for prompt timed out.

Researching a bit I found this LAVA documentation page, which suggests setting boot_character_delay and/or test_character_delay as ways to avoid missing characters in the serial when the device can't keep up.

Also, when the login prompt is reached, there are still many messages being printed by the kernel, which might be interfering, so another idea would be to add a delay so LAVA only sends the command once the serial output has settled down.

laura-nao commented 4 weeks ago

Bumped into the same issue when checking preliminary results for the watchdog reset test: https://lava.collabora.dev/scheduler/job/14140804#L10766

This is another instance where missing characters can lead to a false regression (here /sys/class/watchdog0/timeout became /sys/class/wathdog0/timeout). We should consider adding test_character_delay in the respective device type templates to see if it helps mitigating the issue. @nfraprado, based on your experience, does this only affect the spherion and tomato devices?

nuclearcat commented 4 weeks ago

Maybe adding somewhere shell script:

if [ ! -e /sys/class/watchdog0/timeout ]; then
  echo ERR-NODEV
  echo ERR-NODEV
  echo ERR-NODEV
fi

and triggering on shorter string "ERR-NODEV" (that also repeats, to increase chance it wont be mixed with some other stuff)?

nfraprado commented 4 weeks ago

@nuclearcat We need to fix the serial because we won't be able to change the code for the upstream tests, and we wouldn't want them to have to handle a flaky serial anyway.

@laura-nao So far I've only seen it happen on these two platforms, but I'll add updates here if I see it elsewhere.

I think adding test_character_delay for these platforms will help, but every instance of this issue I've seen so far happened while the serial was still printing a lot of output, so I feel like adding an initial delay would also help a lot. The suggestion I got was to run udevadm settle to wait for the activity on the device to settle down, but that command will be run on the device and so it could itself get mangled... Ideally we would define something in the LAVA template that makes LAVA wait for the serial to settle down before continuing, but I don't know if anything like that exists.

nuclearcat commented 4 weeks ago

I think trogdor also have flaky serial

nfraprado commented 4 weeks ago

@nuclearcat right, I remember trogdor has flaky serial, but I think it was the output, not the input, and in that case the delay wouldn't help. But I'll check it and the other platforms.

For now I've created a PR to for us to test if this does fix it on spherion and tomato: https://github.com/kernelci/kernelci-pipeline/pull/626.

nfraprado commented 3 weeks ago

I noticed the input overrun issue on Spherion only happened on baseline, never on baseline-nfs, meaning that instead of the character delay, an initial delay would probably also solve the issue. In any case, since the PR with the character delay was opened, there hasn't been a failure yet, so it seems to be working so far. But let's give it a bit more time.