oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
2.95k stars 167 forks source link

gimlet went back to A0 while wicket was updating its phase1 flash #1613

Closed jgallagher closed 7 months ago

jgallagher commented 7 months ago

When @augustuswm was mupdating rack3 today, he saw this failure on BRM42220022:

uULOhYYQDDirFtYWKCZHYMLy

The underlying error the SP reported is:

update b694aaf5-a1f7-4f4e-b791-c096844e7b59 failed (code=6)

code=6 translates to HfError::NotMuxedToSp. We then checked, and this sled was back in A0 and stuck in a reboot loop due to mismatched phase1/phase2 flash contents.

We believe the sequence of events was:

  1. Wicket sent the sled to A2. (wicket believes this was successful)
  2. We finished "preparing for update", which erases the flash. (wicket believes this was successful)
  3. While the new phase1 data was being streamed in, the sled went back to A0, which switched the mux back to the host and caused the update writes to fail.

Assuming this sequence is correct, we don't know what caused the sled to go back to A0. There's an SP dump in /staff/core/rack3/colo-64/BRM42220022. I don't know the sequencer side of the SP all that well, but I don't see any smoking guns (e.g., there are no tasks with nonzero generations). I'm not sure we'll be able to get anything useful out of ringbufs, because the host being in a reboot loop means it intentionally went through some number of A2 trips. This is very visible in the host_sp_comms ringbuf, which ends with:

   1  690    13754        1 Request { now: 0x4ace2e6, sequence: 0x6, message: HostBootFailure { reason: 0x2 } }
   2  255    13754        1 Response { now: 0x4ace2e6, sequence: 0x8000000000000006, message: Ack }
   3  690    13754        1 Request { now: 0x4ace2f6, sequence: 0x7, message: HostPanic { code: 0x1101 } }
   4  255    13754        1 Response { now: 0x4ace2f6, sequence: 0x8000000000000007, message: Ack }
   5  690    13754        1 Request { now: 0x4acec94, sequence: 0x8, message: RequestReboot }
   6  356    13754        1 SetState { now: 0x4aced6a, state: A2 }
   7  685    13754        1 ParseError(Cobs)
   8  410    13754        1 JefeNotification { now: 0x4aced6a, state: A2 }
   9  311    13754        1 HfMux { now: 0x4aced6a, state: Some(SP) }
  10 1169    13754        1 SetState { now: 0x4ad00f2, state: A0 }
  11  410    13754        1 JefeNotification { now: 0x4ad043e, state: A0 }
  12  311    13754        1 HfMux { now: 0x4ad043e, state: Some(HostCPU) }

which is exactly what you'd expect when the hosts requests a reboot.

It's possible something external sent a message through MGS or faux-mgs to instruct the sled to go back to A0, so it's possible the SP is completely innocent here. But (a) we don't know what that would've been, and (b) we've never seen such a thing before, so I figured it was probably safest to start with an issue here.

cbiffle commented 7 months ago

Dug into this a bit, and I think the tl;dr is that we're probably missing the information we'd need to root-cause this.

Here's what I know:

At the time the dump was recorded, the SP had been up for 21.80 hours. So, there was no unexpected power cycle or SP reboot during the wicket update.

At the time the dump was recorded, the sequencer was performing continuous reboots (about once every 84 seconds) under command of host-sp-comms, which will have been a response to the host noticing that everything was hosed, I presume. That part seems like working-as-intended.

Unfortunately, gimlet-seq is chatty enough in its ringbuf in response to commands like that that any prior history of interaction was lost. So, gotta look at the code. Fortunately this appears to have been a production firmware image, meaning it's missing udprpc, so any IPCs in the system will have come from other tasks and not some joker on the network.

gimlet-seq itself will only transition autonomously to A0 in one situation: when it has finished board startup in A2, if it is not built with the stay-in-a2 feature (which is applied to lab machines only). Because gimlet-seq's generation in the dump is zero, it only went through init once, presumably before wicket started updating the flash.

All other paths to A0 in the firmware lead through the gimlet-seq IPC interface's set_state operation. There are three users of that in a Gimlet image, of which only two are capable of setting A0: control-plane-agent and host-sp-comms. (thermal also uses it, but only to emergency-power-off the machine.)

control-plane-agent will set the power state in response to its (quite appropriately named) set_power_state operation. host-sp-comms will do it in response to requests from the host. As I mentioned above, the latter is definitely happening, because we can see it in the host-sp-comms ringbuf:

  10 1169    13754        1 SetState { now: 0x4ad00f2, state: A0 }

However, this is probably a red herring, because host-sp-comms only does this in response to the timer that manages a host-initiated reboot. We know the host is requesting reboots -- that's why this bug exists. So unless the host somehow requested an initial reboot in A2 -- y'know, when it is unpowered -- host-sp-comms is probably off the hook for the initial problem.

This leaves us with control-plane-agent. If control-plane-agent were to receive a request to set power state while Wicket was updating the host flash, I don't see any evidence of an interlock that would prevent it from dutifully obeying. Currently, this seems like the most likely scenario: that something else tried to power the host on from the management network. Perhaps some control plane component with lots of disk space has logs?

control-plane-agent itself contains a ringbuf where it records things, but from the overwrite generation number (48,815) and system uptime (77,760 seconds) we can see that it's been destroying its ringbuf contents, on average, every 1.593 seconds since poweron. This causes the ringbuf to not be very useful, IMO, and I have some thoughts below.

Thoughts on things we might want to change in the SP firmware and/or host boot software:

  1. If the host determines that the flash is totally hosed, should it request some sort of "fatal reboot" so that we don't just sit there and cycle, and maybe raise some kind of fault to the control plane? Because rebooting isn't going to fix this. I don't believe we currently have an affordance for this in the protocol. (Edit: The host totally already does this, we're just ignoring it; see below.)
  2. I would like to see counters for significant events, instead of or alongside ringbufs. Ringbufs are useful but it's too easy to overwrite them quickly. (They're also relatively costly in terms of CPU cycles and RAM.)
  3. I would like to see tasks like control-plane-agent separate their ringbufs into, basically, "error/warning" vs "debug/trace" levels. The "important stuff" ringbuf would be written to far less often (not, for example, logging all UDP packets) and could persist for much longer without burning too much RAM.
  4. I think we should have an interlock between transitions in power state and either update state, or host flash mux status. It doesn't make sense that the SP was willing to power the host on during the host flash update.
cbiffle commented 7 months ago

@wesolows has helped me understand the IPCC behavior here, and notes that the host is sending a very specific reboot reason, which we are (mostly) ignoring.

In this ringbuf line:

   1  690    13754        1 Request { now: 0x4ace2e6, sequence: 0x6, message: HostBootFailure { reason: 0x2 } }

...the 0x2 means IPCC_BOOTFAIL_NOPHASE2, which precisely describes the issue. The problem is, it's only going to the ringbuf. The host-sp-comms handling loop actively discards the information:

            HostToSp::HostBootFailure { .. } => {

I'll edit my post above to note that suggestion (1) has in fact been in place the whole time, but we need to implement support for it on the SP.

From a chat discussion, we seem to be broadly agreed that the SP lacks the context to know whether to flip the mux in response to this boot failure reason. In a production case, falling back to a previous good image may or may not be a good idea -- Wicket specifically is attempting to do a recovery image boot, in which case you really don't want to flip back to some arbitrary image. So the right thing to do at the SP is probably

  1. Notice this reason.
  2. Stop boot-looping the poor host.
  3. Raise an alarm to the control plane.

We don't currently have a "raise an alarm to the control plane" facility that I'm aware of, other than the Ignition fault pin -- which really isn't intended for this. We shouldn't use Ignition's fault line for anything that doesn't stop the management network from coming up.

So, as I file followon issues from this, I'll note the need for something like that.

cbiffle commented 7 months ago

See the above-referenced bugs for my follow-on issues. I don't think that the behavior of the server was SP-initiated, so I'm not sure there's specific work to be done on Hubris components in response to it -- I'm happy to leave this issue open if you'd prefer, but otherwise I'd probably close it out. Thoughts?

jgallagher commented 7 months ago

Closing it works for me. We're working on seeing if we can track down whether something in the control plane could've sent the request to go to A0 mid-update.