openbmc / phosphor-pid-control

OpenBMC PID-based Thermal Control Daemon
Apache License 2.0
18 stars 21 forks source link

Startup behaviour if sensors missing is inconsistent and unreliable #33

Open Krellan opened 1 year ago

Krellan commented 1 year ago

Currently, there are a number of bugs in swampd that affect the behavior of the program upon startup, if any of the configured sensors are missing.

This came up in the discussion of https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/61545 and it has been mentioned previous times also, an example being https://github.com/openbmc/phosphor-pid-control/issues/31 for MissingIsAcceptable (which provides only a partial solution).

Here's some of the misfeatures:

Here's something I propose:

Krellan commented 1 year ago

Here's some more background.

There are known issues with phosphor-pid-control and missing sensors at the start of the run.

The failsafe feature only covers sensors that exist at the start of the run, but then start reporting bad values later during the run.

The failsafe feature also only covers bad values. The sensor must still be there, to report the bad values. If the sensor goes missing during the run, that exposes undefined behavior.

The stability of sensor existence is really a challenge that needs to be handled below the layer of phosphor-pid-control, as it is more of an issue for dbus-sensors and entity-manager to manage.

However, there should be better defined behavior in phosphor-pid-control if a configured sensor is missing at the start of the run, hence this feature request.

As of now, the behavior is different between old config.json and new entity-manager code paths! That's a bug.

The correct behavior of phosphor-pid-control is only defined if the sensors exist at the start of the run, and the sensors remain in existence throughout the entire run.

If sensors are added or deleted during the run, that is UB (undefined behavior). Even if it is just a quick delete-then-add-again, during an initialization or recovery code path, that is still UB. Sometimes it will work correctly, and sometimes it will not, and it cannot be relied upon, because it's UB.

It would be desirable to fix this, I believe, however, are there any users out there now who depend on certain quirks of this undefined behavior? Please make it known now, otherwise this UB is fair game to be changed, in an effort to clean it up and make it better defined.