oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3k stars 172 forks source link

Rack 2 powered off and left blinking power sequencers behind #1800

Closed leftwo closed 2 weeks ago

leftwo commented 5 months ago

It was noticed that rack2 (dogfood) was not responding. This rack has a single power whip connected to it.

https://github.com/oxidecomputer/hubris/assets/9903989/f97ad5bb-cc5e-466a-829a-3c169903290b

The rack had powered off, and three of the sequencers were blinking green.

The PSC was removed and re-inserted into the chassis. From chat, @cbiffle wrote this summary:

Alright, what we know / don't about that PSC behavior:

leftwo commented 5 months ago

Video call and discussion just after the event: https://drive.google.com/file/d/1kYvCSa2aSQN7wiYk_M-WqpM9zQqTto1m/view https://drive.google.com/open?id=16SLy5xNv5E_9RvfOjJIGYoFV6jr2p1-8&usp=gmail

Transcript: https://docs.google.com/document/d/1R2Jc3StayZGakRbkOjcEkBSrlbJ8s7Z6C0xrvfLziHY/edit#heading=h.ubrmiriqu6cf

cbiffle commented 5 months ago

Alright, I think we've managed to tease this one out.

When the PSUs hit certain fault conditions they drop their "OK" (active high) line. Up until this month, nobody had written code to actually monitor that line, and we learned of fault conditions in which the PSUs required active intervention to turn back on. In that state they would hang out with an amber light lit.

I added code to the PSC to attempt to cycle the PSUs and clear faults like this, which has been released. Because we don't have a power shelf for testing in EMY, I did all the testing of that change with a hand-wired mockup. It appears my hand-wired mockup got one of the PSU behaviors wrong:

It turns out the PSUs require you to re-enable them before they will stop indicating a fault condition. I had added logic to try to avoid cycling them on and off unnecessarily, which in practice has the effect of never turning them back on in this class of fault condition. We need to change this logic to turn the PSU on and wait a bit before deciding if it's back or not.

While a PSU is disabled in this manner, it blinks its light green at about 1Hz. This means "I'm off," confusingly. This is the signal we've been seeing: it's a sign that the PSC is commanding the PSU off. Due to my misunderstanding of the behavior of the PSU fault signals, it unfortunately never turns it back on.

It turns out that this class of fault condition is relatively easy to reproduce on a lab rack: sneak in via Humility and alter the PSU enable line state. So we have a way to test this in Dogfood now that we have an extender card mounted.

cbiffle commented 2 weeks ago

We're pretty sure this particular failure has been fixed, so I'll close this out.