openbmc / phosphor-pid-control

OpenBMC PID-based Thermal Control Daemon
Apache License 2.0
16 stars 21 forks source link

Work around flakiness by periodically issuing redundant write commands to fan #35

Open Krellan opened 1 year ago

Krellan commented 1 year ago

To avoid flooding the D-Bus interface with traffic, there is an optimization within phosphor-pid-control that avoids issuing a write command to the fan, if the commanded PWM is the same value as it was from earlier. In other words, it tries to avoid writing duplicate PWM that would have no effect.

This is a good optimization, as without it D-Bus can be slowed down so much as to be unusable. However, it needs to be overridden in certain situations, such as when returning to automatic PID control from manual, and thus needing all the fans to receive a new PWM command to be sure they return to what the PID algorithm believes they should be at. This was implemented a while ago: https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/36695

However, it appears this optimization might be too aggressive. It would be worth adding another situation in which it should be overridden. Sometimes, the write path is not 100% reliable, and a fan write command can become lost. When this happens, the fan PWM will not receive the command, so it will remain "stuck" at an old PWM, without changing to the new PWM as it should have been commanded to do.

Various OEM have came up with their own patches to allow for this possibility, to avoid the problem with a fan getting "stuck" at the wrong PWM, by allowing fan writes to happen, regardless if they are redundant or not. Here's an example: https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/37357

I am thinking that a reasonable compromise would be to allow a redundant PWM write only once every X cycles through the main loop, or every Y seconds of elapsed time, whatever would be the most reasonable to implement. This would provide enough opportunities to get the PWM command through, without bogging down the rest of the system by spamming D-Bus needlessly.

Krellan commented 10 months ago

This is increasingly important, because I've heard of some fan vendors now featuring hardware watchdogs within their fan controllers!

If software fails to send a command every so often, to pet the watchdog, the fan goes into its own version of failsafe mode, a preset hardwired speed. So, we need to do a redundant write at some fixed interval of time, and make this a standard feature of phosphor-pid-control going forward, with a sensible default (I'm open to suggestions).

Krellan commented 9 months ago

It will also be necessary to audit dbus-sensors and make a similar change there. Reason is, the sensors try to optimize themselves by discarding redundant writes. In this case, however, we intentionally want to write the same value to the hardware, especially if we are trying to pet a watchdog timer.