openbmc / phosphor-pid-control

OpenBMC PID-based Thermal Control Daemon
Apache License 2.0
16 stars 21 forks source link

Allow objects (sensors, loops, zones) to be designated as non-essential and quietly skipped over if missing #31

Open Krellan opened 1 year ago

Krellan commented 1 year ago

Here's a feature request I am seeing come in from the field.

We need to allow objects to be designated as non-essential.

This will allow them to be quietly skipped over during parsing of the configuration. They are designated as non-essential, so if they are missing, that is acceptable. They will be quietly skipped over, as if they never existed.

I propose adding this flag to input sensors, PID loops, and thermal zones.

What should the syntax be? It has to be something that will work both for the old-style (directly consuming the config.json file) and the new-style (Inventory objects on D-Bus created by entity-manager from JSON files) configuration paths.

This means that if we decorate the elements of the sensors array (the sensor objects) in the old-style JSON, we also need to think of a different way to do this for the new-style configuration path, because we can't decorate sensor objects in the new-style (because they are controlled by entity-manager, not us). However, we could augment the PID loop that uses them, like what was done for TempToMargin to flag certain sensors as being eligible for temperature-to-margin conversion.

PID loops and thermal zones need to also have a flag in this way. There was one attempt to add a catch-all flags parameter, but this patch ran into problems during review, because it's difficult/bureaucratic to make any changes to an established interface. Should we bring this feature back? https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/41802

Interestingly, the thermal zone parameters were successfully augmented later, to add a few more optional parameters to control timing, and interestingly, there was no corresponding upstream objection then: https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/57660

Anyway, once we have agreement on the mechanism, and the flags are added for sensors, loops, and zones, here's the behavioral changes that will result:

How's this? Anything missing here?

Krellan commented 1 year ago

Some thoughts.

There might already be an existing Availability interface we can use. However, this is only for D-Bus sensor objects. It does not extend to PID loop objects, nor to thermal zone objects.

Also, what should the name of this feature be? Unimportant? Nonessential? MissingIsOK? MissingIsAcceptable?

Krellan commented 1 year ago

I like using the name MissingIsAcceptable.

Also, this is a trait from PID control's point of view, not a trait of the sensor itself, so it should not be in the Sensor interface. For this and other reasons, this rules out using the existing Availability flag.

How about this:

In a PID loop (Entity Manager type Pid), how about this:

{
  "Name": "NecessaryTempPidLoop",
  "Type": "Pid",
  ...
  "Inputs": [ "FlakyTemp1", "DependableTemp1", "FlakyTemp2", "DependableTemp2" ],
  ...
  "MissingIsAcceptable": [ "FlakyTemp1", "FlakyTemp2" ],
  ...
}

In this example, there are 4 temperature inputs. Of these, 2 are flaky, and 2 are dependable. The flaky sensors are listed in MissingIsAcceptable so if something happens to them (disappeared, reading unparseable, reading is a floating-point irrational such as NaN, any other error), they will be simply ignored. They will not contribute to the PID loop calculations, as if they never existed.

If FlakyTemp11 or FlakyTemp2 are NaN or missing, nothing happens. If DependableTemp1 or DependableTemp2 are NaN or missing, however, the entire PID loop output will be set to NaN, to correctly propagate the error upward.

This concept extends to thermal zones as well:

{
  "Type": "Pid.Zone",
  "Name": "MyCoolZone",
  "Inputs": [ "NecessaryTempPidLoop", "OptionalTempPidLoop", "MyFanPidLoop" ],
  ...
  "MissingIsAcceptable": [ "OptionalTempPidLoop" ],
  ...
}

In this example, NecessaryTempPidLoop will cause the entire zone to be thrown into failsafe if the output of this PID loop is NaN, as expected. However, because OptionalTempPidLoop is listed in the MissingIsAcceptable array, it is treated as optional, so if it is NaN or any other error happens with it, that's no big deal, and the thermal zone will just behave as if that PID loop never existed.

The use case is to allow some optional sensors and PID loops to help react faster to thermal spikes, and keep the system cool, without having to run the main fans as often. However, the mandatory sensors and PID loops will be there as a backup. Even though they are slower to react, they are enough to keep the system cool, albeit at the cost of having to run the fans more often than is strictly necessary. So, this system is not in any danger of overheating if we omit the optional stuff, but it really is nice to able to have the optional sensors working, as it saves power and runs quieter.

Krellan commented 9 months ago

The MissingIsAcceptable feature has been merged. This should provide most, but not all, of what we want. There are still a few edge cases.