oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 40 forks source link

CPU temperature timeseries is not in degrees C #6634

Closed bnaecker closed 1 month ago

bnaecker commented 1 month ago

Today, we report the CPU temperature via the SP, in the hardware_component:temperature timeseries, specifically those with the "CPU" sensor name. Like all the other temperature measurements, we report those in degrees C. Unfortunately, that's not correct for the CPU temperature metrics the SP has.

There's a lot of subtlety here, which Robert and Keith laid out in this stlouis block comment and this issue. The high-level summary is that these values are a "control temperature", a synthetic number that AMD computes in an unknown way on the basis of the actual hardware sensors a given CPU has. The point of this appears to be to provide a single value that is generic across all the different parts and resource groups (classes of CPUs with different TDP, for example). That number can be used in a thermal control loop, rather than the requiring the loop to figure out which part it's reading from and combine all the junction temperatures itself. (Those actually are in degrees C.)

The best thing to do here is probably cordon off the CPU temperature into its own timeseries. We definitely need to make clear that this is dimensionless, in both the actual unit and the description of the timeseries values as well.

hawkw commented 1 month ago

Whoops, this one's on me. I guess we ought to have the MGS metrics task special-case sensor readings with the name "CPU". I'll look into fixing that.

hawkw commented 1 month ago

A quick fix would be to just change MGS' sensors task to special-case temperature readings with the name "CPU" and device kind "sbtsi", which I took a first pass at in https://github.com/oxidecomputer/omicron/commit/9c502de315229fd5094dd0d373ae10528d60859c. In the long run, though, I think it would be more ideologically correct to change the gateway-sp-comms protocol and add a new MeasurementKind variant for synthetic, dimensionless (so-called) measurements. That's probably a bigger change as it involves plumbing that through Hubris in various places.

hawkw commented 1 month ago

6656 will fix this by having MGS special-case the SP-reported sensor name and device kind strings that represent Tctl measurements. In the future, I'd like to change how the SP reports these metrics in the gateway_sp_comms protocol, so that other tools (like faux-mgs) will also not see Tctl values as degrees Celsius. But, that should probably be tracked on the Hubris or management-gateway-service repos, rather than in Omicron.