oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
2.94k stars 167 forks source link

Power shelf advanced monitoring #874

Open mkeeter opened 1 year ago

mkeeter commented 1 year ago

Once we can log power shelf data (#873), we'll want more advanced monitoring, e.g. to report a rectifier or fan failure.

It will probably go through control-plane-agent, but is pending a broader discussion about logging strategies.

cbiffle commented 4 months ago

Had some rectifiers fault in the colo today, which currently requires a manual intervention, so I'm appropriating this issue report to start tracking how to fix that.

Behavior that would be helpful right now that is not yet implemented includes:

Information we'd probably need to gather to do this:

Likely implementation chunks, not all of which are in this repo, seem like:

As an intermediate option, we could add a task similar to the gimlet-inspector to allow only access to the FRAM blob over the technician port. This could let us collect the fault data without blocking on control plane support.

cbiffle commented 4 months ago

Murata PMBus appnote with register addressing and protocol details, for the record, is here: https://www.murata.com/-/media/webrenewal/products/power/appnote/acan-114.ashx

cbiffle commented 4 months ago

FRAM datasheet: https://www.mouser.com/datasheet/2/1113/MB85RS64T_DS501_00051_2v0_E-2329177.pdf

isobering commented 4 months ago

A decision on what the rectifier fault recovery and retry logic should be in a lights-out situation. (It may be different for a -lab image or something.)

In the near term, the fault recovery procedure should probably be something like:

In the long term, I think the fault recovery behavior should be operator configurable, and it would be good to give them three options for recovering from rectifier faults:

Info on how to do the recovery process on the rectifier.

My current belief is that the only way to recover a rectifier is to toggle its SP_TO_PS_PSU_x_EN_L signal. I have a vague memory of Mark Lerner and/or @ericaasen testing this, but it would have occured before I joined Oxide!

cbiffle commented 4 months ago

SP_TO_PS_PSU_x_EN_L

@isobering Just to make sure I'm doing the right thing here -- the PSC revC rev9 schematic has no nets of that name, I assume the ones in question here are SP_TO_PS_PSU_ON_x_L?

cbiffle commented 4 months ago

On review of the schematic, we are also interested in noticing and exposing changes in the PS_TO_SP_PSU_PRESENT_x_L nets that detect removal/insertion of the power supplies.

rmustacc commented 4 months ago

Which pieces of data from the rectifier should be logged in the FRAM, and in particular, any information specific to a fault diagnosis that we may not already be sending to sensors.

The minimum viable piece here that is useful is going to be starting with the standard PMBus alerting related register, STATUS_WORD which then refers to the other registers:

Note, STATUS_CML is used for a number of different failures and things like unsupported/invalid command/data is likely cases I wouldn't log for, where as the memory and processor failures I would. The device has two rails the primary 54.5V and the 12V standby. If we had to focus on only a single rail for some reason it would want to be the 54.5V, though it's possible that the others. Mostly these STATUS words to me are the most interesting thing we can log if I'm being picky as other data would hopefully end up in sensors and related.

The other thing for us to consider here is that for everything other than INPUT_UV_F there is actually a black box that the controller generates that contains this as well as the readings that triggered this. It can store up to 5 of these. Perhaps grabbing and storing this on a fault, is actually what we should consider rather than a manual set.

A decision on what the rectifier fault recovery and retry logic should be in a lights-out situation. (It may be different for a -lab image or something.)

So, I think the interesting thing to me is that the PSC will survive on its 12V standby even if the 54.5V main stays up. To me what this suggests that our starting point should not be on a per-rectifier basis, but if all 6 rectifiers go down, then we should pretty much probably try to always recover right now. I think that's not the worst starting point. Obviously there are a lot of different ways we can go over time.

In the long term, I think the fault recovery behavior should be operator configurable, and it would be good to give them three options for recovering from rectifier faults:

I'm not sure how much we want this to be a per-rectifier decision and a per-site thing. I think we'll want to work through the different fault cases and figure out when we want to offline a single rectifier due to predictive failure and that the offline decision is probably not something the PSC should make on its own as the question of what we should do will ultimately depend a lot on the load, what's been set here, and related.

Info on how to do the recovery process on the rectifier.

§10.7 'Clearing a shutdown due to a fault' from the PMBus spec has the canonical way to do this that should work here. One thing we should be careful with here is that I expect the signal we have will control both 12V and 54.5V power, where as the operation command may only impact the faulted rail. Based on experimental data it is only the 54.5V rail that has faulted.