Closed bnaecker closed 1 month ago
Whoops, this one's on me. I guess we ought to have the MGS metrics task special-case sensor readings with the name "CPU". I'll look into fixing that.
A quick fix would be to just change MGS' sensors task to special-case temperature readings with the name "CPU" and device kind "sbtsi", which I took a first pass at in https://github.com/oxidecomputer/omicron/commit/9c502de315229fd5094dd0d373ae10528d60859c. In the long run, though, I think it would be more ideologically correct to change the gateway-sp-comms
protocol and add a new MeasurementKind
variant for synthetic, dimensionless (so-called) measurements. That's probably a bigger change as it involves plumbing that through Hubris in various places.
gateway_sp_comms
protocol, so that other tools (like faux-mgs) will also not see Tctl values as degrees Celsius. But, that should probably be tracked on the Hubris or management-gateway-service
repos, rather than in Omicron.
Today, we report the CPU temperature via the SP, in the
hardware_component:temperature
timeseries, specifically those with the"CPU"
sensor name. Like all the other temperature measurements, we report those in degrees C. Unfortunately, that's not correct for the CPU temperature metrics the SP has.There's a lot of subtlety here, which Robert and Keith laid out in this stlouis block comment and this issue. The high-level summary is that these values are a "control temperature", a synthetic number that AMD computes in an unknown way on the basis of the actual hardware sensors a given CPU has. The point of this appears to be to provide a single value that is generic across all the different parts and resource groups (classes of CPUs with different TDP, for example). That number can be used in a thermal control loop, rather than the requiring the loop to figure out which part it's reading from and combine all the junction temperatures itself. (Those actually are in degrees C.)
The best thing to do here is probably cordon off the CPU temperature into its own timeseries. We definitely need to make clear that this is dimensionless, in both the actual unit and the description of the timeseries values as well.