kernelci / kernelci-project

KernelCI Linux Foundation project documentation
10 stars 26 forks source link

Flaky serial on sc7180 (Qualcomm chromebooks) #380

Open nfraprado opened 3 months ago

nfraprado commented 3 months ago

The serial on sc7180 has been known to be flaky for a while. Now that we're getting results from those platforms on the new system that we can do some data analysis on, I'm opening this issue so we can track progress on finally tracking it down and fixing the issue.

Recent example: https://lava.collabora.dev/scheduler/job/14158411

I've recently opened an issue for the serial on some Mediatek platforms: https://github.com/kernelci/kernelci-project/issues/366 . But if seems different in this case, as the issue happened on the output, not the input. I also feel like I've seen this exact STARTRUN error on sc7180 specifically too many times. If that's the only error that happens here it could be an issue specific to the first message that is printed. Need to collect more failed runs.

nfraprado commented 3 months ago

All of the test failures on both limozeen and kingoftown for this week have the exact same LAVA error message returned: Invalid TESTCASE signal, and they all happened on the baseline test, never on baseline-nfs. And there were about 40 instances of them. One big difference between baseline and baseline-nfs is that on baseline the prompt shows at around 5 seconds from boot, and there are still messages going on, while on baseline-nfs the prompt shows up much later at around 24 seconds from boot, and there aren't any messages still being printed, so the system is mostly idle by then.

I added a failure_retry to the test to see if that LAVA STARTRUN message could just be retried if it failed. But some of the commands throwed errors, so it seems they're not prepared to be run multiple times: https://lava.collabora.dev/scheduler/job/14242039

Some ideas that come to mind is trying the test_character_delay (even though I don't think it would solve things in this case, because it's the output not the input), adding an initial delay with udevadm settle or a plain sleep, and looking into the code that prints that STARTRUN message to see if there's anything special about it.

nfraprado commented 3 months ago

test_character_delay didn't help, unsurprisingly: https://lava.collabora.dev/scheduler/job/14278559

Here's a good run for comparison: https://lava.collabora.dev/scheduler/job/14278523

In the failed run, there's:

+ cd /lava-14278559/0/tests/0_dmesg[

The trailing [ is most likely all that is left of the STARTRUN command message that was send to the kernel buffer. Since the [ is there I believe this means that the message made it to the kernel buffer, but the serial output couldn't keep up for whatever reason. It would be good to double check that the message is there in the kernel ring buffer, but I can't think of a way to do that. As soon as the error happens, the test exits, even if there's another test defined later: https://lava.collabora.dev/scheduler/job/14278613

As for adding an initial delay, udevadm settle or sleep won't help actually, for an obvious reason, they would be executed after the STARTRUN message is sent. One alternative would be to use LAVA's command action: https://docs.lavasoftware.org/lava/actions-command.html and sleep on the dispatcher side. But it seems that requires some setup on the LAVA admin side, so not something that I could quickly use for debugging purposes: https://lava.collabora.dev/scheduler/job/14278592

I also checked the code that sends the STARTRUN message: https://github.com/kernelci/test-definitions/blob/4fc8e57d0d72e4d19e4754931edebdc5594c6c16/automated/utils/test-runner.py#L330 . But I didn't find anything special about it, besides being the first command that writes to the output (which goes to /dev/kmsg). The ENDRUN message slightly below looks exactly the same and I've never seen that one missing.

nfraprado commented 3 months ago

I tried using LAVA's login_commands to run udevadm settle before the tests start, but still got failing jobs:

Pass rates and duration stats in seconds:

job                          loops  rate      min    max    avg      std
-------------------------  -------  ------  -----  -----  -----  -------
sc7180-serial-login-delay       10  50%        79     83   80.4  1.51658

sc7180-serial-login-delay failed jobs:
   https://lava.collabora.dev/scheduler/job/14288164
   https://lava.collabora.dev/scheduler/job/14288165
   https://lava.collabora.dev/scheduler/job/14288166
   https://lava.collabora.dev/scheduler/job/14288256
   https://lava.collabora.dev/scheduler/job/14288258

sc7180-serial-login-delay passed jobs:
   https://lava.collabora.dev/scheduler/job/14288163
   https://lava.collabora.dev/scheduler/job/14288167
   https://lava.collabora.dev/scheduler/job/14288195
   https://lava.collabora.dev/scheduler/job/14288252
   https://lava.collabora.dev/scheduler/job/14288257

Then I tried sleeping for 5 seconds in login_commands, and to my surprise, that still fails: https://lava.collabora.dev/scheduler/job/14288292

So it seems this issue doesn't have anything to do with the serial output being busy from boot, and no kind of delay will help.

Still not sure how this happens, given that kernel messages should be printed atomically to the serial, and why the STARTRUN message is the most affected. It would be good if a serial stress test that could guarantee reproducibility of the issue could be found.

nfraprado commented 3 months ago

I was pointed out by Doug Anderson to a series he recently sent to make the serial more reliable on qualcomm platforms, including sc7180: https://lore.kernel.org/all/20240610222515.3023730-1-dianders@chromium.org/#t

The results speak for themselves:

Pass rates and duration stats in seconds:

job                             loops  rate      min    max    avg      std
----------------------------  -------  ------  -----  -----  -----  -------
sc7180-before-serial-patches       20  10%       120    120  120    0
sc7180-after-serial-patches        20  100%       99    122  117.1  7.53169

sc7180-before-serial-patches failed jobs:
   https://lava.collabora.dev/scheduler/job/14478259
   https://lava.collabora.dev/scheduler/job/14478337
   https://lava.collabora.dev/scheduler/job/14478340
   https://lava.collabora.dev/scheduler/job/14478351
   https://lava.collabora.dev/scheduler/job/14478354
   https://lava.collabora.dev/scheduler/job/14478357
   https://lava.collabora.dev/scheduler/job/14478358
   https://lava.collabora.dev/scheduler/job/14478362
   https://lava.collabora.dev/scheduler/job/14478414
   https://lava.collabora.dev/scheduler/job/14478418
   https://lava.collabora.dev/scheduler/job/14478420
   https://lava.collabora.dev/scheduler/job/14478423
   https://lava.collabora.dev/scheduler/job/14478438
   https://lava.collabora.dev/scheduler/job/14478441
   https://lava.collabora.dev/scheduler/job/14478444
   https://lava.collabora.dev/scheduler/job/14478459
   https://lava.collabora.dev/scheduler/job/14478468
   https://lava.collabora.dev/scheduler/job/14478470

sc7180-before-serial-patches passed jobs:
   https://lava.collabora.dev/scheduler/job/14478348
   https://lava.collabora.dev/scheduler/job/14478466

sc7180-after-serial-patches passed jobs:
   https://lava.collabora.dev/scheduler/job/14478260
   https://lava.collabora.dev/scheduler/job/14478338
   https://lava.collabora.dev/scheduler/job/14478341
   https://lava.collabora.dev/scheduler/job/14478347
   https://lava.collabora.dev/scheduler/job/14478350
   https://lava.collabora.dev/scheduler/job/14478352
   https://lava.collabora.dev/scheduler/job/14478356
   https://lava.collabora.dev/scheduler/job/14478359
   https://lava.collabora.dev/scheduler/job/14478361
   https://lava.collabora.dev/scheduler/job/14478370
   https://lava.collabora.dev/scheduler/job/14478380
   https://lava.collabora.dev/scheduler/job/14478412
   https://lava.collabora.dev/scheduler/job/14478416
   https://lava.collabora.dev/scheduler/job/14478419
   https://lava.collabora.dev/scheduler/job/14478422
   https://lava.collabora.dev/scheduler/job/14478434
   https://lava.collabora.dev/scheduler/job/14478439
   https://lava.collabora.dev/scheduler/job/14478442
   https://lava.collabora.dev/scheduler/job/14478445
   https://lava.collabora.dev/scheduler/job/14478463

So before the patch out of 20 runs, 18 failed due to the missing message in the serial. After the patch all 20 runs passed. It's a clear fix. I'll report the results upstream and we just have to wait for it to be merged for this issue to finally go away :).

nfraprado commented 3 weeks ago

There's a new series on the list to fix the serial: https://lore.kernel.org/all/20240902152451.862-1-johan+linaro@kernel.org/

I've tested and it works. Hopefully this one gets merged soon.

Results for reference:

Pass rates and duration stats in seconds:

job                                     loops  rate      min    max      avg       std
------------------------------------  -------  ------  -----  -----  -------  --------
before-fix-console-corruption-series       10  30%       126    140  134.333   7.37111
after-fix-console-corruption-series        10  100%      120    162  135.7    17.3529

before-fix-console-corruption-series failed jobs:
   https://lava.collabora.dev/scheduler/job/15486156
   https://lava.collabora.dev/scheduler/job/15486251
   https://lava.collabora.dev/scheduler/job/15486493
   https://lava.collabora.dev/scheduler/job/15486796
   https://lava.collabora.dev/scheduler/job/15486813
   https://lava.collabora.dev/scheduler/job/15487024
   https://lava.collabora.dev/scheduler/job/15487240

before-fix-console-corruption-series passed jobs:
   https://lava.collabora.dev/scheduler/job/15486124
   https://lava.collabora.dev/scheduler/job/15486565
   https://lava.collabora.dev/scheduler/job/15486756

after-fix-console-corruption-series passed jobs:
   https://lava.collabora.dev/scheduler/job/15486125
   https://lava.collabora.dev/scheduler/job/15486157
   https://lava.collabora.dev/scheduler/job/15486268
   https://lava.collabora.dev/scheduler/job/15486495
   https://lava.collabora.dev/scheduler/job/15486579
   https://lava.collabora.dev/scheduler/job/15486757
   https://lava.collabora.dev/scheduler/job/15486789
   https://lava.collabora.dev/scheduler/job/15486798
   https://lava.collabora.dev/scheduler/job/15487023
   https://lava.collabora.dev/scheduler/job/15487225
nuclearcat commented 3 weeks ago

Amazing news!!!

nfraprado commented 2 weeks ago

The series has been included into tty-next, so should land in mainline very soon.