Power shelf advanced monitoring

mkeeter commented 1 year ago

Once we can log power shelf data (#873), we'll want more advanced monitoring, e.g. to report a rectifier or fan failure.

It will probably go through control-plane-agent, but is pending a broader discussion about logging strategies.

cbiffle commented 4 months ago

Had some rectifiers fault in the colo today, which currently requires a manual intervention, so I'm appropriating this issue report to start tracking how to fix that.

Behavior that would be helpful right now that is not yet implemented includes:

The ability to read the "what happened" information out of the rectifier, including, at minimum, the fault/alert information accessible over PMBus.
Recording that information to the FRAM so that it is accessible across a power cycle, since it can't come out via the management network if the power supply is off.
Control logic to at least attempt to recover the rectifier and turn power back on. (This is a rare case in the system where the microcontroller really truly does need to be responsible for this class of decision, because the Bigger Computers may not be powered on, due to the rectifier trip).

Information we'd probably need to gather to do this:

Which pieces of data from the rectifier should be logged in the FRAM, and in particular, any information specific to a fault diagnosis that we may not already be sending to sensors.
A decision on what the rectifier fault recovery and retry logic should be in a lights-out situation. (It may be different for a -lab image or something.)
Info on how to do the recovery process on the rectifier.

Likely implementation chunks, not all of which are in this repo, seem like:

Detection of rectifier faults in the PSC firmware and/or extension of existing detection to write more information into the FRAM.
FRAM support in the PSC firmware (I don't see existing support, though they tend to be software-compatible with 24xx series I2C EEPROMs, so it's possible we're using it already with that driver)
Rectifier recovery support in the PSC firmware.
Extensions to control plane agent to report the availability of an FRAM blob up-stack, and to allow it to be extracted and then marked as unneeded.
Extensions to the control plane to notice and collect said blob and put it somewhere we can get to it.

As an intermediate option, we could add a task similar to the gimlet-inspector to allow only access to the FRAM blob over the technician port. This could let us collect the fault data without blocking on control plane support.

cbiffle commented 4 months ago

Murata PMBus appnote with register addressing and protocol details, for the record, is here: https://www.murata.com/-/media/webrenewal/products/power/appnote/acan-114.ashx

cbiffle commented 4 months ago

FRAM datasheet: https://www.mouser.com/datasheet/2/1113/MB85RS64T_DS501_00051_2v0_E-2329177.pdf

isobering commented 4 months ago

A decision on what the rectifier fault recovery and retry logic should be in a lights-out situation. (It may be different for a -lab image or something.)

In the near term, the fault recovery procedure should probably be something like:

Detect a fault (either by monitoring the PWR_OK_L signals or by detecting a PMBus ALERT signal and reading the fault)
Assert a high level on SP_TO_PS_PSU_x_EN_L to disable the rectifier
Dump whatever error information you want over PMBus and write it to the FRAM
Wait a small integer number of seconds for transient power line stuff to clear
Assert a low level on SP_TO_PS_PSU_x_EN_L to enable the rectifier
If the rectifier is still faulted after a small integer number of attempts (say, three attempts), turn the rectifier off and do not keep trying to recover - there's probably something wrong with the rectifier.

In the long term, I think the fault recovery behavior should be operator configurable, and it would be good to give them three options for recovering from rectifier faults:

Keep the rectifier off and do not retry
Retry [n] times, where the operator specifies the number of attempts [n]
Retry continuously every [t] seconds, where the operator specifies the time [t]

Info on how to do the recovery process on the rectifier.

My current belief is that the only way to recover a rectifier is to toggle its SP_TO_PS_PSU_x_EN_L signal. I have a vague memory of Mark Lerner and/or @ericaasen testing this, but it would have occured before I joined Oxide!

cbiffle commented 4 months ago

SP_TO_PS_PSU_x_EN_L

@isobering Just to make sure I'm doing the right thing here -- the PSC revC rev9 schematic has no nets of that name, I assume the ones in question here are SP_TO_PS_PSU_ON_x_L?

cbiffle commented 4 months ago

On review of the schematic, we are also interested in noticing and exposing changes in the PS_TO_SP_PSU_PRESENT_x_L nets that detect removal/insertion of the power supplies.

rmustacc commented 4 months ago

Which pieces of data from the rectifier should be logged in the FRAM, and in particular, any information specific to a fault diagnosis that we may not already be sending to sensors.

The minimum viable piece here that is useful is going to be starting with the standard PMBus alerting related register, STATUS_WORD which then refers to the other registers:

STATUS_VOUT
STATUS_IOUT
STATUS_INPUT
STATUS_CML
STATUS_TEMPERATURE
STATUS_MFR_SPECIFIC
STATUS_FANS_1_2

Note, STATUS_CML is used for a number of different failures and things like unsupported/invalid command/data is likely cases I wouldn't log for, where as the memory and processor failures I would. The device has two rails the primary 54.5V and the 12V standby. If we had to focus on only a single rail for some reason it would want to be the 54.5V, though it's possible that the others. Mostly these STATUS words to me are the most interesting thing we can log if I'm being picky as other data would hopefully end up in sensors and related.

The other thing for us to consider here is that for everything other than INPUT_UV_F there is actually a black box that the controller generates that contains this as well as the readings that triggered this. It can store up to 5 of these. Perhaps grabbing and storing this on a fault, is actually what we should consider rather than a manual set.

A decision on what the rectifier fault recovery and retry logic should be in a lights-out situation. (It may be different for a -lab image or something.)

So, I think the interesting thing to me is that the PSC will survive on its 12V standby even if the 54.5V main stays up. To me what this suggests that our starting point should not be on a per-rectifier basis, but if all 6 rectifiers go down, then we should pretty much probably try to always recover right now. I think that's not the worst starting point. Obviously there are a lot of different ways we can go over time.

In the long term, I think the fault recovery behavior should be operator configurable, and it would be good to give them three options for recovering from rectifier faults:

I'm not sure how much we want this to be a per-rectifier decision and a per-site thing. I think we'll want to work through the different fault cases and figure out when we want to offline a single rectifier due to predictive failure and that the offline decision is probably not something the PSC should make on its own as the question of what we should do will ultimately depend a lot on the load, what's been set here, and related.

Info on how to do the recovery process on the rectifier.

§10.7 'Clearing a shutdown due to a fault' from the PMBus spec has the canonical way to do this that should work here. One thing we should be careful with here is that I expect the signal we have will control both 12V and 54.5V power, where as the operation command may only impact the faulted rail. Based on experimental data it is only the 54.5V rail that has faulted.

oxidecomputer / hubris

Power shelf advanced monitoring #874