Open Krellan opened 1 year ago
Some thoughts.
There might already be an existing Availability
interface we can use. However, this is only for D-Bus sensor objects. It does not extend to PID loop objects, nor to thermal zone objects.
Also, what should the name of this feature be? Unimportant
? Nonessential
? MissingIsOK
? MissingIsAcceptable
?
I like using the name MissingIsAcceptable
.
Also, this is a trait from PID control's point of view, not a trait of the sensor itself, so it should not be in the Sensor
interface. For this and other reasons, this rules out using the existing Availability
flag.
How about this:
In a PID loop (Entity Manager type Pid
), how about this:
{
"Name": "NecessaryTempPidLoop",
"Type": "Pid",
...
"Inputs": [ "FlakyTemp1", "DependableTemp1", "FlakyTemp2", "DependableTemp2" ],
...
"MissingIsAcceptable": [ "FlakyTemp1", "FlakyTemp2" ],
...
}
In this example, there are 4 temperature inputs. Of these, 2 are flaky, and 2 are dependable. The flaky sensors are listed in MissingIsAcceptable
so if something happens to them (disappeared, reading unparseable, reading is a floating-point irrational such as NaN
, any other error), they will be simply ignored. They will not contribute to the PID loop calculations, as if they never existed.
If FlakyTemp1
1 or FlakyTemp2
are NaN
or missing, nothing happens. If DependableTemp1
or DependableTemp2
are NaN
or missing, however, the entire PID loop output will be set to NaN
, to correctly propagate the error upward.
This concept extends to thermal zones as well:
{
"Type": "Pid.Zone",
"Name": "MyCoolZone",
"Inputs": [ "NecessaryTempPidLoop", "OptionalTempPidLoop", "MyFanPidLoop" ],
...
"MissingIsAcceptable": [ "OptionalTempPidLoop" ],
...
}
In this example, NecessaryTempPidLoop
will cause the entire zone to be thrown into failsafe if the output of this PID loop is NaN
, as expected. However, because OptionalTempPidLoop
is listed in the MissingIsAcceptable
array, it is treated as optional, so if it is NaN
or any other error happens with it, that's no big deal, and the thermal zone will just behave as if that PID loop never existed.
The use case is to allow some optional sensors and PID loops to help react faster to thermal spikes, and keep the system cool, without having to run the main fans as often. However, the mandatory sensors and PID loops will be there as a backup. Even though they are slower to react, they are enough to keep the system cool, albeit at the cost of having to run the fans more often than is strictly necessary. So, this system is not in any danger of overheating if we omit the optional stuff, but it really is nice to able to have the optional sensors working, as it saves power and runs quieter.
The MissingIsAcceptable
feature has been merged. This should provide most, but not all, of what we want. There are still a few edge cases.
Here's a feature request I am seeing come in from the field.
We need to allow objects to be designated as non-essential.
This will allow them to be quietly skipped over during parsing of the configuration. They are designated as non-essential, so if they are missing, that is acceptable. They will be quietly skipped over, as if they never existed.
I propose adding this flag to input sensors, PID loops, and thermal zones.
What should the syntax be? It has to be something that will work both for the old-style (directly consuming the
config.json
file) and the new-style (Inventory objects on D-Bus created byentity-manager
from JSON files) configuration paths.This means that if we decorate the elements of the
sensors
array (the sensor objects) in the old-style JSON, we also need to think of a different way to do this for the new-style configuration path, because we can't decorate sensor objects in the new-style (because they are controlled byentity-manager
, not us). However, we could augment the PID loop that uses them, like what was done forTempToMargin
to flag certain sensors as being eligible for temperature-to-margin conversion.PID loops and thermal zones need to also have a flag in this way. There was one attempt to add a catch-all
flags
parameter, but this patch ran into problems during review, because it's difficult/bureaucratic to make any changes to an established interface. Should we bring this feature back? https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/41802Interestingly, the thermal zone parameters were successfully augmented later, to add a few more optional parameters to control timing, and interestingly, there was no corresponding upstream objection then: https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/57660
Anyway, once we have agreement on the mechanism, and the flags are added for sensors, loops, and zones, here's the behavioral changes that will result:
If sensor is flagged as non-essential, and it isn't found during loading, it will be silently skipped over, as if it never existed in the configuration.
If sensor is flagged as non-essential, and it fails during reading, or returns a
NaN
value during reading, this error will be ignored, and the sensor's contribution to PID loop input will be ignored, as if the sensor never existed.If a PID loop has no inputs at all, because all of the inputs were non-essential and they weren't there, the zone will still be thrown into failsafe mode, unless the PID loop itself is also designated as non-essential.
If a PID loop is designated as non-essential, if it is unable to produce an output value for any reason (all input sensors are missing or failed), then this PID loop will be ignored, as if it never existed in the configuration.
If a zone is flagged as non-essential, it will not result in any errors if it fails. No fan output values will be written, if there was a problem with the zone. As with other objects, it will simply be ignored, as if it never existed.
How's this? Anything missing here?