Diagnostic features for update failures

cbiffle commented 2 months ago

We had a sidecar SP fail update at a customer site today in a rather ambiguous manner. This issue is intended to collect ideas for diagnostic tools we could have built that would have helped today, so that we can hopefully build them before this reproduces much more.

One possibility is that this is simply an MGS timeout that has drifted out of sync with how long Sidecar takes to boot in practice. We know Sidecar boot is nondeterministic (https://github.com/oxidecomputer/hardware-sidecar/issues/741) so if the timeout is marginal, it could happen rarely for certain units.

Potential root causes I've floated, and tools that might help distinguish them, include:

Update was written incorrectly due to corruption, off-by-one, or other bug.
- The ability to read out, or at least hash, the idle bank would let us check for this.
SP rebooted into new image which was written correctly, but something failed during initialization that prevented the network from coming up.
- It would be really great to be able to read a dump from a previous boot out of the dump area, to see if anything panicked last boot.
Sidecar may have taken too long to start up for the timeout in MGS, and this might all be an illusion.
- MGS may want to revise up that timeout (I would also argue for making it configurable, for the next time this happens)
- We should take a pass over Sidecar startup and check for any optimizations we could make there.

Please add more ideas.

labbott commented 2 months ago

We need a way for the RoT to tell us that it triggered a bank swap besides the ringbuf

cbiffle commented 2 months ago

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. https://github.com/oxidecomputer/management-gateway-service/issues/284

labbott commented 2 months ago

Read out/measure the auxflash (thanks @lzrd for the idea)

jgallagher commented 2 months ago

Ask the SP what its current time is (this is exclusively upstack work; the MgsRequest::CurrentTime message already exists and is supported on the SP: https://github.com/oxidecomputer/management-gateway-service/issues/283)

cbiffle commented 2 months ago

In case anyone's investigating cores from this, we also get a crash of both the sequencer and power tasks. These crashes are both deliberate in the code and don't appear to be related, except possibly in causing some of the boot time nondeterminism.

https://github.com/oxidecomputer/hubris/blob/master/task/power/src/bsp/sidecar_bcd.rs#L42C1-L42C33

https://github.com/oxidecomputer/hubris/blob/master/drv/sidecar-seq-server/src/main.rs#L919

cbiffle commented 2 months ago

Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284

@rmustacc pointed out that at least part of the variability will be coming from our accidental hardware random number generator: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/116

One such debugging session, with logs and stuff, here: https://github.com/oxidecomputer/hardware-sidecar/issues/830

oxidecomputer / hubris

Diagnostic features for update failures #1867