Closed jgallagher closed 7 months ago
Dug into this a bit, and I think the tl;dr is that we're probably missing the information we'd need to root-cause this.
Here's what I know:
At the time the dump was recorded, the SP had been up for 21.80 hours. So, there was no unexpected power cycle or SP reboot during the wicket update.
At the time the dump was recorded, the sequencer was performing continuous reboots (about once every 84 seconds) under command of host-sp-comms
, which will have been a response to the host noticing that everything was hosed, I presume. That part seems like working-as-intended.
Unfortunately, gimlet-seq is chatty enough in its ringbuf in response to commands like that that any prior history of interaction was lost. So, gotta look at the code. Fortunately this appears to have been a production firmware image, meaning it's missing udprpc, so any IPCs in the system will have come from other tasks and not some joker on the network.
gimlet-seq itself will only transition autonomously to A0 in one situation: when it has finished board startup in A2, if it is not built with the stay-in-a2
feature (which is applied to lab machines only). Because gimlet-seq's generation in the dump is zero, it only went through init once, presumably before wicket started updating the flash.
All other paths to A0 in the firmware lead through the gimlet-seq IPC interface's set_state
operation. There are three users of that in a Gimlet image, of which only two are capable of setting A0: control-plane-agent
and host-sp-comms
. (thermal
also uses it, but only to emergency-power-off the machine.)
control-plane-agent
will set the power state in response to its (quite appropriately named) set_power_state
operation. host-sp-comms
will do it in response to requests from the host. As I mentioned above, the latter is definitely happening, because we can see it in the host-sp-comms
ringbuf:
10 1169 13754 1 SetState { now: 0x4ad00f2, state: A0 }
However, this is probably a red herring, because host-sp-comms
only does this in response to the timer that manages a host-initiated reboot. We know the host is requesting reboots -- that's why this bug exists. So unless the host somehow requested an initial reboot in A2 -- y'know, when it is unpowered -- host-sp-comms
is probably off the hook for the initial problem.
This leaves us with control-plane-agent
. If control-plane-agent
were to receive a request to set power state while Wicket was updating the host flash, I don't see any evidence of an interlock that would prevent it from dutifully obeying. Currently, this seems like the most likely scenario: that something else tried to power the host on from the management network. Perhaps some control plane component with lots of disk space has logs?
control-plane-agent
itself contains a ringbuf where it records things, but from the overwrite generation number (48,815) and system uptime (77,760 seconds) we can see that it's been destroying its ringbuf contents, on average, every 1.593 seconds since poweron. This causes the ringbuf to not be very useful, IMO, and I have some thoughts below.
Thoughts on things we might want to change in the SP firmware and/or host boot software:
control-plane-agent
separate their ringbufs into, basically, "error/warning" vs "debug/trace" levels. The "important stuff" ringbuf would be written to far less often (not, for example, logging all UDP packets) and could persist for much longer without burning too much RAM.@wesolows has helped me understand the IPCC behavior here, and notes that the host is sending a very specific reboot reason, which we are (mostly) ignoring.
In this ringbuf line:
1 690 13754 1 Request { now: 0x4ace2e6, sequence: 0x6, message: HostBootFailure { reason: 0x2 } }
...the 0x2
means IPCC_BOOTFAIL_NOPHASE2
, which precisely describes the issue. The problem is, it's only going to the ringbuf. The host-sp-comms
handling loop actively discards the information:
HostToSp::HostBootFailure { .. } => {
I'll edit my post above to note that suggestion (1) has in fact been in place the whole time, but we need to implement support for it on the SP.
From a chat discussion, we seem to be broadly agreed that the SP lacks the context to know whether to flip the mux in response to this boot failure reason. In a production case, falling back to a previous good image may or may not be a good idea -- Wicket specifically is attempting to do a recovery image boot, in which case you really don't want to flip back to some arbitrary image. So the right thing to do at the SP is probably
We don't currently have a "raise an alarm to the control plane" facility that I'm aware of, other than the Ignition fault pin -- which really isn't intended for this. We shouldn't use Ignition's fault line for anything that doesn't stop the management network from coming up.
So, as I file followon issues from this, I'll note the need for something like that.
See the above-referenced bugs for my follow-on issues. I don't think that the behavior of the server was SP-initiated, so I'm not sure there's specific work to be done on Hubris components in response to it -- I'm happy to leave this issue open if you'd prefer, but otherwise I'd probably close it out. Thoughts?
Closing it works for me. We're working on seeing if we can track down whether something in the control plane could've sent the request to go to A0 mid-update.
When @augustuswm was mupdating rack3 today, he saw this failure on
BRM42220022
:The underlying error the SP reported is:
code=6
translates toHfError::NotMuxedToSp
. We then checked, and this sled was back in A0 and stuck in a reboot loop due to mismatched phase1/phase2 flash contents.We believe the sequence of events was:
Assuming this sequence is correct, we don't know what caused the sled to go back to A0. There's an SP dump in
/staff/core/rack3/colo-64/BRM42220022
. I don't know the sequencer side of the SP all that well, but I don't see any smoking guns (e.g., there are no tasks with nonzero generations). I'm not sure we'll be able to get anything useful out of ringbufs, because the host being in a reboot loop means it intentionally went through some number of A2 trips. This is very visible in thehost_sp_comms
ringbuf, which ends with:which is exactly what you'd expect when the hosts requests a reboot.
It's possible something external sent a message through MGS or faux-mgs to instruct the sled to go back to A0, so it's possible the SP is completely innocent here. But (a) we don't know what that would've been, and (b) we've never seen such a thing before, so I figured it was probably safest to start with an issue here.