oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3.01k stars 173 forks source link

Diagnostic features for update failures #1867

Open cbiffle opened 1 month ago

cbiffle commented 1 month ago

We had a sidecar SP fail update at a customer site today in a rather ambiguous manner. This issue is intended to collect ideas for diagnostic tools we could have built that would have helped today, so that we can hopefully build them before this reproduces much more.

One possibility is that this is simply an MGS timeout that has drifted out of sync with how long Sidecar takes to boot in practice. We know Sidecar boot is nondeterministic (https://github.com/oxidecomputer/hardware-sidecar/issues/741) so if the timeout is marginal, it could happen rarely for certain units.

Potential root causes I've floated, and tools that might help distinguish them, include:

Please add more ideas.

labbott commented 1 month ago

We need a way for the RoT to tell us that it triggered a bank swap besides the ringbuf

cbiffle commented 1 month ago

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. https://github.com/oxidecomputer/management-gateway-service/issues/284

labbott commented 1 month ago

Read out/measure the auxflash (thanks @lzrd for the idea)

jgallagher commented 1 month ago

Ask the SP what its current time is (this is exclusively upstack work; the MgsRequest::CurrentTime message already exists and is supported on the SP: https://github.com/oxidecomputer/management-gateway-service/issues/283)

cbiffle commented 1 month ago

In case anyone's investigating cores from this, we also get a crash of both the sequencer and power tasks. These crashes are both deliberate in the code and don't appear to be related, except possibly in causing some of the boot time nondeterminism.

https://github.com/oxidecomputer/hubris/blob/master/task/power/src/bsp/sidecar_bcd.rs#L42C1-L42C33

https://github.com/oxidecomputer/hubris/blob/master/drv/sidecar-seq-server/src/main.rs#L919

cbiffle commented 1 month ago

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284

@rmustacc pointed out that at least part of the variability will be coming from our accidental hardware random number generator: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/116

One such debugging session, with logs and stuff, here: https://github.com/oxidecomputer/hardware-sidecar/issues/830