Shutdown ChromeBooks properly

a-wai commented 6 months ago

Currently, we issue the poweroff command at the end of a test job. This results in the following output:

System has not been booted with systemd as init system (PID 1). Can't operate.

Moreover, this is only executed when tests run successfully, meaning there are many occurrences where we simply cut the power to the device. This could lead to filesystem corruption and consequently trigger issues like #2491.

We should instead ensure we issue the appropriate command, and always do so.

nuclearcat commented 6 months ago

I wonder if we can trace filesystem corruption reasons, it might be big deal(bug) if chromebook crashes, and if user lose OS or data on it.

10ne1 commented 5 months ago

poweroff should be implemented by upstart, so that systemd init system output is very strange and unexpected.

I happened to be running a upstream kernel on mt8183 and do not see it when running poweroff. Attached the full log output from serial console. jacuzzi_poweroff.log.txt

a-wai commented 5 months ago

See #2497 & #2498 for details, but basically the issue with the systemd message was that the command was pretty much ssh sync && poweroff, with sync being executed on the ChromeBook through ssh, but poweroff was executed on the LAVA runner, inside the test container (which obviously can't work).

The LAVA jobs linked in the comments for those PRs show devices properly shutting down, so I think this issue can be closed once those are merged :)

a-wai commented 5 months ago

A good example of a corrupt device likely due to a hard reboot is dell-latitude-5400-4305U-sarien-cbg-0:

job 13357222 runs perfectly fine
DUT likely crashed during job 13357236 and was forced off when the job timed out
13357275 exhibits the sign of a corrupt filesystem

This case will be more complex to handle though, as we can't really prevent devices from crashing while running the tests...

nuclearcat commented 5 months ago

Sarien issue plaguing me since long time, maybe we need to report to google that, if this is looking like a bug. The fact that it doesnt happen on other devices increase probability it is a bug or hardware specific issue. It is abnormal, that system get corrupted on crash.

10ne1 commented 5 months ago

Thank you @a-wai for clarifying the poweroff command was running on the lava runner instead of the board. That makes sense and explains the systemd message.

Left a comment on the commit, it's not a blocker, I assume upstart will do the right thing and sync before poweroff, so feel free to ignore it.

10ne1 commented 5 months ago

Sarien issue plaguing me since long time, maybe we need to report to google that, if this is looking like a bug. The fact that it doesnt happen on other devices increase probability it is a bug or hardware specific issue. It is abnormal, that system get corrupted on crash.

Yes, we should report, however first we need to reproduce and investigate the issues.

For eg. I'm not entirely certain this job pointed by Arnaud has a crash according to the CrOS definition of a crash (in that case the next step is to get and analyze a core dump from the crash handler deamon, which seems to not be involved in the logs). In this specific case, maybe the kernel experienced a hang / hard lock-up...

Either way, yes, needs further investigation.

a-wai commented 5 months ago

For eg. I'm not entirely certain this job pointed by Arnaud has a crash according to the CrOS definition of a crash [...]. In this specific case, maybe the kernel experienced a hang / hard lock-up...

It seems so (kernel hang) based on the kernel logs, my choice of words was indeed sub-optimal.

a-wai commented 5 months ago

Sarien issue plaguing me since long time

"Fun" fact regarding the devices I re-flashed last week: those are all (yes, all!) devices from the following types:

sarien (including arcada, all Dell devices, Intel chip)
grunt (Acer and multiple HP models, AMD chip)
zork (Lenovo and multiple HP models, AMD chip)
hatch (Asus, Intel chip)

There's very likely a problem with those device types making them more likely to FS corruption.

For other device types, there are a few MTK affected (3x corsola, 2x asurada, 2x cherry), and only 1x of both skyrim and trogdor. The latter can be considered rare incidents, and MTK devices should be monitored until we have more stats.

a-wai commented 5 months ago

Interestingly, it seems no asus-CM1400CXA-dalboz (which are actually zork device types) is affected; however, those weren't used by the new system until today (see https://github.com/kernelci/kernelci-pipeline/pull/541), so let's see how it goes now...

a-wai commented 4 months ago

Closing as the templates were reworked to send poweroff to the DUT after the tests ran.

kernelci / kernelci-core

Shutdown ChromeBooks properly #2492