Open cbiffle opened 1 month ago
We need a way for the RoT to tell us that it triggered a bank swap besides the ringbuf
Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. https://github.com/oxidecomputer/management-gateway-service/issues/284
Read out/measure the auxflash (thanks @lzrd for the idea)
Ask the SP what its current time is (this is exclusively upstack work; the MgsRequest::CurrentTime
message already exists and is supported on the SP: https://github.com/oxidecomputer/management-gateway-service/issues/283)
In case anyone's investigating cores from this, we also get a crash of both the sequencer and power tasks. These crashes are both deliberate in the code and don't appear to be related, except possibly in causing some of the boot time nondeterminism.
https://github.com/oxidecomputer/hubris/blob/master/task/power/src/bsp/sidecar_bcd.rs#L42C1-L42C33
https://github.com/oxidecomputer/hubris/blob/master/drv/sidecar-seq-server/src/main.rs#L919
Note that the successful SP reboot we observed was 26 seconds; the timeout is 30 seconds, and we've seen 5+ seconds of variability. oxidecomputer/management-gateway-service#284
@rmustacc pointed out that at least part of the variability will be coming from our accidental hardware random number generator: https://github.com/oxidecomputer/hardware-qsfp-x32/issues/116
One such debugging session, with logs and stuff, here: https://github.com/oxidecomputer/hardware-sidecar/issues/830
We had a sidecar SP fail update at a customer site today in a rather ambiguous manner. This issue is intended to collect ideas for diagnostic tools we could have built that would have helped today, so that we can hopefully build them before this reproduces much more.
One possibility is that this is simply an MGS timeout that has drifted out of sync with how long Sidecar takes to boot in practice. We know Sidecar boot is nondeterministic (https://github.com/oxidecomputer/hardware-sidecar/issues/741) so if the timeout is marginal, it could happen rarely for certain units.
Potential root causes I've floated, and tools that might help distinguish them, include:
Please add more ideas.