htop-dev / htop

htop - an interactive process viewer
https://htop.dev/
GNU General Public License v2.0
6.39k stars 438 forks source link

Temperatures for cores not showing #1284

Open AnOpenSauceDev opened 1 year ago

AnOpenSauceDev commented 1 year ago

When using htop via SSH on my Ubuntu server, i notice that even if i enable Also show CPU temperature (libsensors5 is installed), no temperature reading appears. I'm unsure if this is because of my core count or not (40 threads total), but no matter what i do, nothing will show up alongside the usage reading.

BenBE commented 1 year ago

What does the full output for sensors -u look like?

AnOpenSauceDev commented 1 year ago
Log File `$ sensors -u` ``` coretemp-isa-0001 Adapter: ISA adapter Package id 1: temp1_input: 38.000 temp1_max: 85.000 temp1_crit: 95.000 temp1_crit_alarm: 0.000 Core 0: temp2_input: 31.000 temp2_max: 85.000 temp2_crit: 95.000 temp2_crit_alarm: 0.000 Core 1: temp3_input: 33.000 temp3_max: 85.000 temp3_crit: 95.000 temp3_crit_alarm: 0.000 Core 2: temp4_input: 31.000 temp4_max: 85.000 temp4_crit: 95.000 temp4_crit_alarm: 0.000 Core 3: temp5_input: 33.000 temp5_max: 85.000 temp5_crit: 95.000 temp5_crit_alarm: 0.000 Core 4: temp6_input: 33.000 temp6_max: 85.000 temp6_crit: 95.000 temp6_crit_alarm: 0.000 Core 8: temp7_input: 35.000 temp7_max: 85.000 temp7_crit: 95.000 temp7_crit_alarm: 0.000 Core 9: temp8_input: 31.000 temp8_max: 85.000 temp8_crit: 95.000 temp8_crit_alarm: 0.000 Core 10: temp9_input: 34.000 temp9_max: 85.000 temp9_crit: 95.000 temp9_crit_alarm: 0.000 Core 11: temp10_input: 35.000 temp10_max: 85.000 temp10_crit: 95.000 temp10_crit_alarm: 0.000 Core 12: temp11_input: 33.000 temp11_max: 85.000 temp11_crit: 95.000 temp11_crit_alarm: 0.000 power_meter-acpi-0 Adapter: ACPI interface power1: power1_average: 136.000 power1_average_interval: 0.001 coretemp-isa-0000 Adapter: ISA adapter Package id 0: temp1_input: 38.000 temp1_max: 85.000 temp1_crit: 95.000 temp1_crit_alarm: 0.000 Core 0: temp2_input: 31.000 temp2_max: 85.000 temp2_crit: 95.000 temp2_crit_alarm: 0.000 Core 1: temp3_input: 34.000 temp3_max: 85.000 temp3_crit: 95.000 temp3_crit_alarm: 0.000 Core 2: temp4_input: 34.000 temp4_max: 85.000 temp4_crit: 95.000 temp4_crit_alarm: 0.000 Core 3: temp5_input: 32.000 temp5_max: 85.000 temp5_crit: 95.000 temp5_crit_alarm: 0.000 Core 4: temp6_input: 37.000 temp6_max: 85.000 temp6_crit: 95.000 temp6_crit_alarm: 0.000 Core 8: temp7_input: 37.000 temp7_max: 85.000 temp7_crit: 95.000 temp7_crit_alarm: 0.000 Core 9: temp8_input: 36.000 temp8_max: 85.000 temp8_crit: 95.000 temp8_crit_alarm: 0.000 Core 10: temp9_input: 38.000 temp9_max: 85.000 temp9_crit: 95.000 temp9_crit_alarm: 0.000 Core 11: temp10_input: 37.000 temp10_max: 85.000 temp10_crit: 95.000 temp10_crit_alarm: 0.000 Core 12: temp11_input: 35.000 temp11_max: 85.000 temp11_crit: 95.000 temp11_crit_alarm: 0.000 i350bb-pci-0100 Adapter: PCI adapter loc1: temp1_input: 57.000 temp1_max: 120.000 temp1_crit: 110.000 ```
BenBE commented 1 year ago

Do you know how these sensors are distributed amongst the cores? If I count correctly and assume temperature 0 of each coretemp block to be the overall package temperature I see 24 sensors, which would amount to 48 cores.

@cgzones @fasterit Can you two take a look at this?

AnOpenSauceDev commented 1 year ago

The server setup i have is 2x E5-2680 v2's, which should only be two threads per core. So I'm assuming it should only be 20 sensors. Oddly enough, btop detects all sensors fine, which makes me think it could possibly be a htop issue.

SergeyKharenko commented 1 year ago

My server has the same problem: Motherboard: Supermicro X10 DRi-T CPU: Dual E5-2698V3 Here is the terminal: image

In sensors -u, the temperature of each core is correct: image

Htop is one of my favorite programs. I would appreciate it if the problem were fixed!

AnOpenSauceDev commented 1 year ago

My problem is that none show at all, but i still have a valid sensors reading.

BenBE commented 1 year ago

@Kharlenkow :

In sensors -u, the temperature of each core is correct:

Please provide the output as plain text. While images are fine to point at UI issues or convey what the display looks like, they usually aren't very accessible or easy for further processing. Also Your screenshot is missing (the interesting) part of the sensors -u output.

Htop is one of my favorite programs.

Glad to hear.

I would appreciate it if the problem were fixed!

Will have to see if we find a solution to properly process the available information and correlate it with our internal view of the system. This is not the first report regarding CPU sensor stuff – and likely not the last. That stuff is strange at times.

@AnOpenSauceDev Can you provide the full contents of /proc/cpuinfo? It looks kinda strange that core IDs aren't contiguous in the sensors -u output.

Also, if you want to help a bit with investigations: Can you try to establish some kind of mapping of physical cores to the temperature sensor cores by putting some load on individual CPU threads (affinity binding) and checking which temperature follows the load? TIA.

@Kharlenkow In case you have a different CPU, having the same information (cpuinfo, sensors reading, physical<-->sensors mapping) available would be nice.

AnOpenSauceDev commented 1 year ago

@AnOpenSauceDev Can you provide the full contents of /proc/cpuinfo? It looks kinda strange that core IDs aren't contiguous in the sensors -u output.

cpuinfo.txt

It might take a while to benchmark every core, but so far nothing seems off.

BenBE commented 1 year ago

Thank you for that info. Seems this strange core ID counting is in the CPU info as well. At least makes things consistent. :)

SergeyKharenko commented 1 year ago

@BenBE

Please provide the output as plain text. While images are fine to point at UI issues or convey what the display looks like, they usually aren't very accessible or easy for further processing. Also Your screenshot is missing (the interesting) part of the sensors -u output:

Thank you for attention my feedback!!! Here is the entire output:

coretemp-isa-0001
Adapter: ISA adapter
Package id 1:
  temp1_input: 41.000
  temp1_max: 80.000
  temp1_crit: 98.000
  temp1_crit_alarm: 0.000
Core 0:
  temp2_input: 33.000
  temp2_max: 80.000
  temp2_crit: 98.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 33.000
  temp3_max: 80.000
  temp3_crit: 98.000
  temp3_crit_alarm: 0.000
Core 2:
  temp4_input: 32.000
  temp4_max: 80.000
  temp4_crit: 98.000
  temp4_crit_alarm: 0.000
Core 3:
  temp5_input: 34.000
  temp5_max: 80.000
  temp5_crit: 98.000
  temp5_crit_alarm: 0.000
Core 4:
  temp6_input: 34.000
  temp6_max: 80.000
  temp6_crit: 98.000
  temp6_crit_alarm: 0.000
Core 5:
  temp7_input: 34.000
  temp7_max: 80.000
  temp7_crit: 98.000
  temp7_crit_alarm: 0.000
Core 6:
  temp8_input: 33.000
  temp8_max: 80.000
  temp8_crit: 98.000
  temp8_crit_alarm: 0.000
Core 7:
  temp9_input: 32.000
  temp9_max: 80.000
  temp9_crit: 98.000
  temp9_crit_alarm: 0.000
Core 8:
  temp10_input: 35.000
  temp10_max: 80.000
  temp10_crit: 98.000
  temp10_crit_alarm: 0.000
Core 9:
  temp11_input: 34.000
  temp11_max: 80.000
  temp11_crit: 98.000
  temp11_crit_alarm: 0.000
Core 10:
  temp12_input: 31.000
  temp12_max: 80.000
  temp12_crit: 98.000
  temp12_crit_alarm: 0.000
Core 11:
  temp13_input: 35.000
  temp13_max: 80.000
  temp13_crit: 98.000
  temp13_crit_alarm: 0.000
Core 12:
  temp14_input: 31.000
  temp14_max: 80.000
  temp14_crit: 98.000
  temp14_crit_alarm: 0.000
Core 13:
  temp15_input: 35.000
  temp15_max: 80.000
  temp15_crit: 98.000
  temp15_crit_alarm: 0.000
Core 14:
  temp16_input: 33.000
  temp16_max: 80.000
  temp16_crit: 98.000
  temp16_crit_alarm: 0.000
Core 15:
  temp17_input: 34.000
  temp17_max: 80.000
  temp17_crit: 98.000
  temp17_crit_alarm: 0.000

power_meter-acpi-0
Adapter: ACPI interface
power1:
  power1_average: 4294967.295
  power1_average_interval: 1.000

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:
  temp1_input: 41.000
  temp1_max: 80.000
  temp1_crit: 98.000
  temp1_crit_alarm: 0.000
Core 0:
  temp2_input: 34.000
  temp2_max: 80.000
  temp2_crit: 98.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 33.000
  temp3_max: 80.000
  temp3_crit: 98.000
  temp3_crit_alarm: 0.000
Core 2:
  temp4_input: 35.000
  temp4_max: 80.000
  temp4_crit: 98.000
  temp4_crit_alarm: 0.000
Core 3:
  temp5_input: 33.000
  temp5_max: 80.000
  temp5_crit: 98.000
  temp5_crit_alarm: 0.000
Core 4:
  temp6_input: 36.000
  temp6_max: 80.000
  temp6_crit: 98.000
  temp6_crit_alarm: 0.000
Core 5:
  temp7_input: 34.000
  temp7_max: 80.000
  temp7_crit: 98.000
  temp7_crit_alarm: 0.000
Core 6:
  temp8_input: 32.000
  temp8_max: 80.000
  temp8_crit: 98.000
  temp8_crit_alarm: 0.000
Core 7:
  temp9_input: 31.000
  temp9_max: 80.000
  temp9_crit: 98.000
  temp9_crit_alarm: 0.000
Core 8:
  temp10_input: 34.000
  temp10_max: 80.000
  temp10_crit: 98.000
  temp10_crit_alarm: 0.000
Core 9:
  temp11_input: 33.000
  temp11_max: 80.000
  temp11_crit: 98.000
  temp11_crit_alarm: 0.000
Core 10:
  temp12_input: 33.000
  temp12_max: 80.000
  temp12_crit: 98.000
  temp12_crit_alarm: 0.000
Core 11:
  temp13_input: 34.000
  temp13_max: 80.000
  temp13_crit: 98.000
  temp13_crit_alarm: 0.000
Core 12:
  temp14_input: 36.000
  temp14_max: 80.000
  temp14_crit: 98.000
  temp14_crit_alarm: 0.000
Core 13:
  temp15_input: 35.000
  temp15_max: 80.000
  temp15_crit: 98.000
  temp15_crit_alarm: 0.000
Core 14:
  temp16_input: 33.000
  temp16_max: 80.000
  temp16_crit: 98.000
  temp16_crit_alarm: 0.000
Core 15:
  temp17_input: 32.000
  temp17_max: 80.000
  temp17_crit: 98.000
  temp17_crit_alarm: 0.000

I can absolutely sure two CPUs are the same because I personally installed them onto the socket, unless I was cheated by the seller~ My /proc/cpuinfo file is here: cpuinfo.txt

BenBE commented 1 year ago

Thank you for the quick feedback.

I did some study of the documentation of the coretemp stuff and it seems the main issue in htop comes down to how the sensors are mapped onto the actual CPU cores. This will likely take a bit of work, as currently the information related to the cpuinfo (and thus core layout) is not kept for correlation in the libsensors code.

Also, the libsensors code assumes the core IDs to be contiguous, which is clearly not the case with the example by @AnOpenSauceDev. The second issue arises with multiple coretemp instances due to multiple CPUs present in the system. Both being issues that can be resolved when properly mapping the core IDs of the coretemp instances to the physical CPU cores available from /proc/cpuinfo.

@cgzones Can you please take a look at refactoring the libsensors code? Would be nice if we could implement some proper mapping of sensors to their physical cores.

The heuristic could still remain similar to what it is now, being all cores inherit Tctrl, Tdie followed by Tccd{X}, with only parts of the information cleared out, if multiple readings are available on the same core (e.g. acpitz + coretemp). If acpitz gives temperatures for cores not covered by coretemp, those should still keep the acpitz readings.

References:

SergeyKharenko commented 1 year ago

Thanks again for your attention!!!

CAUSES:

I referred codes in linux/LibSensors.c and created an simple test.

int main() {
    sensors_init(NULL);
    int n = 0;
    for (const sensors_chip_name* chip = sensors_get_detected_chips(NULL, &n);chip; chip = sensors_get_detected_chips(NULL, &n)){
        cout<<"SENSOR:"<<chip->prefix<<endl;
        int m=0;
        for(const sensors_feature* feature = sensors_get_features(chip, &m);feature; feature = sensors_get_features(chip, &m)){
            cout<<"    name "<<feature->name<<endl;
        }

    }
}

Here is the output: (It is run on another dual socket server, Dual Xeon E5-2643 V3 , which has the same problem)

SENSOR:coretemp
    name temp1
    name temp2
    name temp3
    name temp4
    name temp5
    name temp6
    name temp7
SENSOR:amdgpu
    name in0
    name fan1
    name temp1
    name power1
SENSOR:i350bb
    name temp1
SENSOR:nvme
    name temp1
SENSOR:coretemp
    name temp1
    name temp2
    name temp3
    name temp4
    name temp5
    name temp6
    name temp7
SENSOR:power_meter
    name power1

There are two SENSORs named coretemp. They have completely same feature. However in the Line 185 of linux/LibSensors.c: unsigned long int tempID = strtoul(feature->name + strlen("temp"), NULL, 10); tempID is assigned by the number followed by temp. The bug would appear when updating the temp of the second CPU because of the FAULT tempID. For example temp1 of the second CPU should be stored at the index of 6 of the cpu temp array. As a result, temp value of the first CPU are overwrite by the second.

SOLUTION

step 0: In Machine.h add value CPUsockets in the structure of Machine

step 1: In linux/LinuxMachine.c , add function 'LinuxMachine_updateCPUsockets' to get the value of CPUsockets by open /sys/devices/system/node/has_cpu Take my machine as an example, the output is '0-1'. I guess in single socket system, it might be '0-0'.

step 2: In linux/LibSensors.c, add value bias=existingCPUs/CPUsockets and int current_CPUsocket=0 at the beginning of the function LibSensors_getCPUTemperatures. Besides, charge the Line 185 to: unsigned long int tempID = strtoul(feature->name + strlen("temp"), NULL, 10)+bias*current_CPUsocket; Don't forget to add current_CPUsocket++ after the Line 211!

step 3: Change Line 256-262. Update the temp socket by socket. (I suggest reading the file /sys/devices/system/cpu/smt/active for SMT/HT judgment).

Since I have been a little busy at work recently, the code has not been implemented on the original project(i am so sorry TOT). Hope my suggestions would be adopted!

BenBE commented 1 year ago

That's still incomplete. because your solution does not properly track, which instance of coretemp is associated with which physical CPU. Overall it's not as simple as laid out, because you need track the topology; which is currently unimplemented.

SergeyKharenko commented 8 months ago

In the past two days, I have consulted the source code of the hwmon subsystem and lm-sensors, and tested it with numactl (which is able to force the task to run on a certain CPU core).

First of all, the lm-sensors reading method of increasing the tempX by suffix number in each hwmon group exactly corresponds to the sequential increase of the core id (at least on my three machines), and there is no exception of out-of-order correspondence as you described.

In addition, for multi-socket motherboards, I also tested and verified the one-to-one correspondence between the CPU socket ID number and nodeX in the system directory (at least this is true on dual-socket motherboards, I don't have four-socket and above motherboards to test). Actually property addr of the structure sensors_chip_name in the library lm_sensors also indicates the actual CPU socket ID. By the way, sub-folders nodeX is under the folder /sys/devices/system/node/. Under these sub-folders, a file named cpulist describes core IDs in the system that bind to each physical CPU. That completely solves the problem of the ownership of the system CPU core ID to the CPU socket ID.

To sum up, we could first use the cpulist files to deduce which CPU socket the core belongs to according to the cpuX suffix X in /proc/stat, and then associate the hwmon group to the corresponding CPU socket through the addr attribute sensors_chip_name in lm-sensors. In a single hwmon group, call the lm-sensors API and read the temperature of each core in order.

Hope my suggestions would be adopted!

SergeyKharenko commented 8 months ago

@BenBE @AnOpenSauceDev I browse the pull request list and find she has done what I want. #1352

I also test her fork using the same method. As is shown below, the problem has been solved. Besides, core ID and its temperature are correctly corresponded. Hope this PR will be accepted!

image