openbmc / phosphor-pid-control

OpenBMC PID-based Thermal Control Daemon
Apache License 2.0
16 stars 21 forks source link

the sensors reading on plugable boards issue (valid sensors detection method proposal) #25

Open huangalang opened 1 year ago

huangalang commented 1 year ago

As we found that we have to define sensor reading path in config.json Then buildSensors() function will search all the sensor reading paths defined in config.json Once they are found , they are added to SensorManager (mgmr.addSensor()) (https://github.com/openbmc/phosphor-pid-control/blob/master/sensors/builder.cpp#L184)

Once pid service started , the added sensors reading are updated if the property changed =>void DbusPidZone::updateSensors(void)

However, buildSensors() process will fail when any of the defined sensor reading paths is not found. It will restart over and over

take following case as example : the sensors on a pluggable board The Sensors path need to be predefined in config.json But they wont exist , if the board is not plugged It will let buildSensors() process repeat over and over => (https://github.com/openbmc/phosphor-pid-control/blob/master/sensors/builder.cpp#L99)

We’d like propose a method called valid sensor detection The main idea is to create validSensors map recording if the sensor reading path is found within a certain time, Only the valid ones will be added to sensor manager (https://github.com/openbmc/phosphor-pid-control/blob/master/sensors/builder.cpp#L184)

And this validSensors map will be passed on to buildZones() process

Because only the valid sensors can be added to thermal_inputs of each zone (we use this map to identify the valid sensors) Through this way , only the valid sensors reading will be updated =>

=========================================

void DbusPidZone::updateSensors(void) { for (const auto& t : _thermalInputs) <== only the valid sensors in _thermalInputs { auto sensor = _mgr.getSensor(t); <=== only the valid sensors in mgr ReadReturn r = sensor->read(); }

}

please share your ideas, we are open to further discussion

Krellan commented 1 year ago

Here's my thoughts:

1) It looks like a bug, if buildSensors() is in a tight restart loop. It should not be in a restart loop.

Expected behavior is that it should set all zones with missing sensors to failsafe mode, and continue loading. Zones without missing sensors should function normally. Zones with missing sensors should remain in failsafe mode.

If all of the missing sensors appear at a later time, and the zone is no longer missing any sensors, the zone should come out of failsafe mode, and begin normal operation. I'm hoping the program receives notifications when new sensors are added, so that it doesn't have to be manually restarted in order to search for sensors again. If not, that's a bug.

2) We should have a "missing is acceptable" flag, for each sensor. Default to false.

If we simply filter out all missing sensors, then we run the risk of overheating the system if an important sensor is missing. We should set the system to failsafe mode, instead. Failsafe mode is designed to be used if any sensors are missing, or not reporting valid data.

Some unimportant sensors may be designated as optional. Those, I'm OK with filtering them out in the updateSensors() loop, as you described. This designation would be a new feature, so that it would not impact existing sensors already in the field.

Alternatively, a compile-time constant could be used, to change all sensors from mandatory to optional, defaulting to false. This would most likely be much easier to implement. By defaulting to false, it would not affect existing systems in the field, so it would not compromise their ability to go into failsafe mode when sensors are missing.

harveyquta commented 1 year ago

I'm not sure if it's a bug or not. When I was using static json before, all of the sensors in the "inputs" list need to exist on dbus, or the service will start to fail. But in dbusconfiguration, it is acceptable that the sensors in "inputs" list don't exist on dbus.

Krellan commented 1 year ago

A pluggable board should trigger some auto-detection, perhaps using entity-manager JSON probe/exposes files, so that the PID loops are only instantiated when the board is plugged in.

There is no easy solution for this case. Failsafe mode is bad, because there's no need to have the fans spin fast, to cool a board that isn't there. Ignoring missing sensors is bad, because the board could overheat if any sensors on it go bad and disappear.

As for static JSON versus D-Bus, that is a strange behavior. It sounds like a bug to me. It would be great if we could address both of the code paths and clean up any differences in behavior like this, that we find.

zevweiss commented 1 year ago

We should have a "missing is acceptable" flag, for each sensor. Default to false.

Isn't that essentially what "unavailableAsFailed": false in the config provides? (Albeit with an inverted boolean sense.)

As for static JSON versus D-Bus, that is a strange behavior.

Sounds reminiscent of this...

huangalang commented 1 year ago

@zevweiss
as we know this flag unavailableAsFailed": false , only works for the case when the sensors are unavaible on the fly , ex : cpu temp sensors is unavalible when host is powered off in this case sensor reading path is obtained at buildSensors() , so this unavailableAsFailed will be set while creating dbus passive sensor object (https://github.com/openbmc/phosphor-pid-control/blob/master/dbus/dbuspassive.cpp#L83)

but the sensors we are talking about is the sensors on plugable board they will not be found since the path wont be there from the begining, so the dbuspassive sensor wont be created, and buildSensors() will restart over and over (https://github.com/openbmc/phosphor-pid-control/blob/master/dbus/dbuspassive.cpp#L73)

huangalang commented 1 year ago

@Krellan , as you suggested , missing sensors should be devided into two types 1 unacceptable ones, in this case , we should keep the original behaviour , let it restart when it's not found

  1. acceptable ones , in this case we may go for the method we proposed =>only add valid ones to _thermalInput by distinguishing these two types , we may use the flag you adviced "missing is acceptable"
huangalang commented 1 year ago

harveyquta

how did you get that result? our experiments shows buildSensors() basically fails, when the sensor path are not found (the board is not plugged)

harveyquta commented 1 year ago

When using phosphor-hwmon + statis JSON before, I added DIMM sensors to an input list.

"inputs": [ "dimm0","dimm1","dimm2","dimm3","dimm4","dimm5","dimm6","dimm7","dimm8","dimm9","dimm10","dimm11","dimm12","dimm13","dimm14","dimm15"],

But when one of dimm sensors in this list doesn't exist on dbus(didn't insert DIMM device on board), pid service will start to fail and restart.

huangalang commented 1 year ago

@harveyquta we use enity-manager +dbus-sensors with phosphor-pid-control , so it's the different story