ibm-openbmc / dev

Product Development Project Mgmt and Tracking
16 stars 2 forks source link

Hot adapter temp sensor fan control loop integration support #1668

Closed derekhoward55 closed 2 years ago

derekhoward55 commented 4 years ago

Hot adapter support. No GPUs temps. New pldm PDRs/fru records for phyp to send list of device/vendor ids to BMC.

BMC has table which associates device/vendor id (and ccin for future usage) with either a device address and register to read, or a "hot adapter table" indication (this was in pcie-power.xml file in mrw). If device id not in table, ignore. If the device id is in table and has which i2c register to read, read that via i2c device driver to get temp, and integrate into fan control algorithm.

If "hot adapter table" indication, use hot adapters higher floor (floor also based on ambient, maybe altitude, maybe power save mode).

The info from the def file and pcie-power.xml file read from fan control events.json config file and entity manager config files.

Need floors, t_controls, error limits, etc for adapters as well from power management def file.
Proposed new location for data from def file:

  1. Fan floor for the given pcie-cooling-type: store in /fans/phosphor-fan-control-events-config/witherspoon/events.json
  2. t_control (high&low) for card type (temperature value at which to speed up fans): store in /fans/phosphor-fan-control-events-config/witherspoon/events.json
  3. t_inc, t_drop for card type (rpm amount to inc/dec fans): store in /fans/phosphor-fan-control-events-config/witherspoon/events.json
  4. error (temp value at which to log error): store in /dbus/thermal-policy/ibm-ac-server/thermal-policy.json
  5. sample time (how often to read sensor): Used in hwmon. configurable?
  6. sample error count (# of fails before raise fans, log error): TBD

Ensure DD for sensors support failure property. Ensure hot-swap handled appropriately.

A later enhancement could be to get ccin from bmc instead of device id from phyp. However windows exist where temps aren't readable, but guaranteed to be readable after phyp powers them on and sends to bmc. Also still need hot adapter support just for adapters for which we can't read temps from via i2c.

Put this info, power management def file info, and pcie-power.xml in LLDD?

Note:

Not all pcie cards have present detect. Most don't.

Not all pcie cards for which phyp can get the vendor/device/etc IDs, can BMC collect vpd for. Most don't, just ones for which IBM has thermal sensor. Still may need higher floor even if no vpd.

The slots do not have to be turned on by phyp to be able to read temps. VPD and temp sensors on sb power. Windows exist before phyp powers them on where temps aren't readable.

TBD if ccin is as specific/unique as vendor/device/etc ID.

In general, if can't support this or nvme drives or pcie switches etc, have to raise the default floors.

spinler commented 4 years ago

The epic to put the PCIE device properties on D-Bus using PLDM is #2625.

spinler commented 2 years ago

All stories complete. Closing.