oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
2.96k stars 169 forks source link

Gimlet SPs should not report AMD CPU Tctl as a temperature measurement in Celsius #1881

Open hawkw opened 5 days ago

hawkw commented 5 days ago

The CPU temperature value we get from AMD host CPUs over SB-TSI represents the CPU's Tctl, which is not a measurement of physical temperature in Celsius like other temperature sensors. Instead, it's a dimensionless value from 0-100 calculated by the CPU's thermal control loop that represents a sort of "thermal throttling danger level". See oxidecomputer/omicron#6634,oxidecomputer/stlouis#5, or this block comment in stlouis for details on what this value actually means.

Right now, the control-plane-agent task will report Tctl values as a sensor measurement with MeasurementKind::Temperature, which suggests to other software that this is a physical temperature measurement in degrees Celsius. We should change this so that it's clearer that this is a dimensionless measurement. As of oxidecomputer/omicron#6656, the MGS subsystem that collects Oximeter metrics from SP sensor measurements special-cases Tctl based on the sensor name and device ID, so the Oximeter metrics are not recorded with incorrect units. This is probably the most important place to get it right, since Oximeter metrics are the primary way we expect customers to consume SP sensor data.

However, because we are currently special-casing Tctl only in the MGS sensor metrics task, these values will still be treated as physical temperature in Celsius elsewhere --- for instance, in faux-mgs, humility sensors, and to other consumers of the MGS HTTP API for reading SP component details. This is a bit of a bummer, so we should probably fix this in a more principled way.

Doing this in Hubris is probably a bit annoying, as we still want to represent the SB-TSI Tctl value as a "temperature sensor" as far as task-thermal is concerned, as it's an important control parameter for the SP thermal control code...but, we would need to indicate that task-sensors and control-plane-agent should report the value from that particular temperature sensor with different units. Entirely doable, but just requires some plumbing.

While we're here, we probably want to consider changing the Hubris-reported names for this measurement to make it more clear that it's specifically the CPU's Tctl value. Right now, SB-TSI measurements are reported as a component with device: "sbtsi" and description: "CPU temperature sensor", which has one measurement channel with sensor: "CPU" (the Tctl value. We should probably consider changing this to sensor: "Tctl" or sensor: "CPU Tctl" to make this clearer --- especially if we were to support reading other CPU thermal measurements, such as Tdie or Tccdn from Hubris in the future.[^1]

[^1]: Note that, as I understand it, those values can't be read through SB-TSI over SMBus, and would instead need to come over IPCC from the host or something. So, I'm not sure if/when we'll actually expose them to Hubris...