giampaolo / psutil

Cross-platform lib for process and system monitoring in Python
BSD 3-Clause "New" or "Revised" License
10.19k stars 1.38k forks source link

[Linux] sensors_temperatures wrong when using thermal_zones #2439

Open siena-sam opened 1 week ago

siena-sam commented 1 week ago

Summary

Description

Quick Description

On systems without /sys/class/hwmon/hwmon*/temp*_* to determine the temperature psutil will fallback to /sys/class/thermal/thermal_zone*. In the code that gets the temperature from the thermal_zone files, however, it fails to reset the high and critical values it is tracking for each trip point such that the value can be divided by 1000 multiple times resulting in an incorrect value. Somehow, the final value returned and the possible value generated in each iteration of the loop need to be tracked separately so that future iterations do not alter the same value multiple times.

Example Scenario

Take the following scenario (which we are actively facing). A device has 2 trip points configured: /sys/class/thermal/thermal_zone0/trip_point_0_* and /sys/class/thermal/thermal_zone0/trip_point_1_*. trip_point_1_type is critical and trip_point_0_type is passive.

In the following snippet of code (taken from https://github.com/giampaolo/psutil/blob/release-5.9.8/psutil/_pslinux.py#L1457) a list of all trip points are gathered and iterated through. The variables high and critical are set to None only before the iteration through all trip points, not on each iteration. If it iterates through trip point 1 and then trip point 0 it will first correctly get the critical temperature from /sys/class/thermal/thermal_zone0/trip_point_1_temp and divide it by 1000 since critical is not None. On the second iteration, however, it will fail to update either high or critical because the type is passive. high will still be None and critical will still be the previous, correct value from the first iteration. On line 1481 critical will have a value (instead of None) so it will convert it to a float and divide by 1000.0 again. Thus the final value returned will be incorrect as it was divided by 1000 an extra time. If there are more trip points configured this value would be divided even more times resulting in possibly very small values. We have a different instance with 5 trip points where the configured critical threshold was 95 but the value returned for critical and high was 9.5e-11 (95000/1000 5 times).

image

Related Issues

The issue https://github.com/giampaolo/psutil/issues/1873 reported by https://github.com/lior-ll also demonstrates this bug in that when they force it to use thermal_zones the returned value is 110_000/1000 6 times (because there are 6 temperature trips configured) thus returning 1.1e-13 instead of the expected 110. I did not add to that ticket as it focused on wanting to use thermal_zones even if hwmon is present on the system while this ticket focuses on an issue with the thermal_zone temperature calculation.

siena-sam commented 1 week ago

Here is a proposed fix ready for review: https://github.com/giampaolo/psutil/pull/2440

siena-sam commented 1 week ago

Here is a proposed fix ready for review: https://github.com/giampaolo/psutil/pull/2440