Open AnOpenSauceDev opened 1 year ago
What does the full output for sensors -u
look like?
Do you know how these sensors are distributed amongst the cores? If I count correctly and assume temperature 0 of each coretemp block to be the overall package temperature I see 24 sensors, which would amount to 48 cores.
@cgzones @fasterit Can you two take a look at this?
The server setup i have is 2x E5-2680 v2's, which should only be two threads per core. So I'm assuming it should only be 20 sensors. Oddly enough, btop
detects all sensors fine, which makes me think it could possibly be a htop
issue.
My server has the same problem: Motherboard: Supermicro X10 DRi-T CPU: Dual E5-2698V3 Here is the terminal:
In sensors -u
, the temperature of each core is correct:
Htop is one of my favorite programs. I would appreciate it if the problem were fixed!
My problem is that none show at all, but i still have a valid sensors reading.
@Kharlenkow :
In
sensors -u
, the temperature of each core is correct:
Please provide the output as plain text. While images are fine to point at UI issues or convey what the display looks like, they usually aren't very accessible or easy for further processing. Also Your screenshot is missing (the interesting) part of the sensors -u
output.
Htop is one of my favorite programs.
Glad to hear.
I would appreciate it if the problem were fixed!
Will have to see if we find a solution to properly process the available information and correlate it with our internal view of the system. This is not the first report regarding CPU sensor stuff – and likely not the last. That stuff is strange at times.
@AnOpenSauceDev Can you provide the full contents of /proc/cpuinfo
? It looks kinda strange that core IDs aren't contiguous in the sensors -u
output.
Also, if you want to help a bit with investigations: Can you try to establish some kind of mapping of physical cores to the temperature sensor cores by putting some load on individual CPU threads (affinity binding) and checking which temperature follows the load? TIA.
@Kharlenkow In case you have a different CPU, having the same information (cpuinfo, sensors reading, physical<-->sensors mapping) available would be nice.
@AnOpenSauceDev Can you provide the full contents of /proc/cpuinfo? It looks kinda strange that core IDs aren't contiguous in the sensors -u output.
It might take a while to benchmark every core, but so far nothing seems off.
Thank you for that info. Seems this strange core ID counting is in the CPU info as well. At least makes things consistent. :)
@BenBE
Please provide the output as plain text. While images are fine to point at UI issues or convey what the display looks like, they usually aren't very accessible or easy for further processing. Also Your screenshot is missing (the interesting) part of the sensors -u output:
Thank you for attention my feedback!!! Here is the entire output:
coretemp-isa-0001
Adapter: ISA adapter
Package id 1:
temp1_input: 41.000
temp1_max: 80.000
temp1_crit: 98.000
temp1_crit_alarm: 0.000
Core 0:
temp2_input: 33.000
temp2_max: 80.000
temp2_crit: 98.000
temp2_crit_alarm: 0.000
Core 1:
temp3_input: 33.000
temp3_max: 80.000
temp3_crit: 98.000
temp3_crit_alarm: 0.000
Core 2:
temp4_input: 32.000
temp4_max: 80.000
temp4_crit: 98.000
temp4_crit_alarm: 0.000
Core 3:
temp5_input: 34.000
temp5_max: 80.000
temp5_crit: 98.000
temp5_crit_alarm: 0.000
Core 4:
temp6_input: 34.000
temp6_max: 80.000
temp6_crit: 98.000
temp6_crit_alarm: 0.000
Core 5:
temp7_input: 34.000
temp7_max: 80.000
temp7_crit: 98.000
temp7_crit_alarm: 0.000
Core 6:
temp8_input: 33.000
temp8_max: 80.000
temp8_crit: 98.000
temp8_crit_alarm: 0.000
Core 7:
temp9_input: 32.000
temp9_max: 80.000
temp9_crit: 98.000
temp9_crit_alarm: 0.000
Core 8:
temp10_input: 35.000
temp10_max: 80.000
temp10_crit: 98.000
temp10_crit_alarm: 0.000
Core 9:
temp11_input: 34.000
temp11_max: 80.000
temp11_crit: 98.000
temp11_crit_alarm: 0.000
Core 10:
temp12_input: 31.000
temp12_max: 80.000
temp12_crit: 98.000
temp12_crit_alarm: 0.000
Core 11:
temp13_input: 35.000
temp13_max: 80.000
temp13_crit: 98.000
temp13_crit_alarm: 0.000
Core 12:
temp14_input: 31.000
temp14_max: 80.000
temp14_crit: 98.000
temp14_crit_alarm: 0.000
Core 13:
temp15_input: 35.000
temp15_max: 80.000
temp15_crit: 98.000
temp15_crit_alarm: 0.000
Core 14:
temp16_input: 33.000
temp16_max: 80.000
temp16_crit: 98.000
temp16_crit_alarm: 0.000
Core 15:
temp17_input: 34.000
temp17_max: 80.000
temp17_crit: 98.000
temp17_crit_alarm: 0.000
power_meter-acpi-0
Adapter: ACPI interface
power1:
power1_average: 4294967.295
power1_average_interval: 1.000
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:
temp1_input: 41.000
temp1_max: 80.000
temp1_crit: 98.000
temp1_crit_alarm: 0.000
Core 0:
temp2_input: 34.000
temp2_max: 80.000
temp2_crit: 98.000
temp2_crit_alarm: 0.000
Core 1:
temp3_input: 33.000
temp3_max: 80.000
temp3_crit: 98.000
temp3_crit_alarm: 0.000
Core 2:
temp4_input: 35.000
temp4_max: 80.000
temp4_crit: 98.000
temp4_crit_alarm: 0.000
Core 3:
temp5_input: 33.000
temp5_max: 80.000
temp5_crit: 98.000
temp5_crit_alarm: 0.000
Core 4:
temp6_input: 36.000
temp6_max: 80.000
temp6_crit: 98.000
temp6_crit_alarm: 0.000
Core 5:
temp7_input: 34.000
temp7_max: 80.000
temp7_crit: 98.000
temp7_crit_alarm: 0.000
Core 6:
temp8_input: 32.000
temp8_max: 80.000
temp8_crit: 98.000
temp8_crit_alarm: 0.000
Core 7:
temp9_input: 31.000
temp9_max: 80.000
temp9_crit: 98.000
temp9_crit_alarm: 0.000
Core 8:
temp10_input: 34.000
temp10_max: 80.000
temp10_crit: 98.000
temp10_crit_alarm: 0.000
Core 9:
temp11_input: 33.000
temp11_max: 80.000
temp11_crit: 98.000
temp11_crit_alarm: 0.000
Core 10:
temp12_input: 33.000
temp12_max: 80.000
temp12_crit: 98.000
temp12_crit_alarm: 0.000
Core 11:
temp13_input: 34.000
temp13_max: 80.000
temp13_crit: 98.000
temp13_crit_alarm: 0.000
Core 12:
temp14_input: 36.000
temp14_max: 80.000
temp14_crit: 98.000
temp14_crit_alarm: 0.000
Core 13:
temp15_input: 35.000
temp15_max: 80.000
temp15_crit: 98.000
temp15_crit_alarm: 0.000
Core 14:
temp16_input: 33.000
temp16_max: 80.000
temp16_crit: 98.000
temp16_crit_alarm: 0.000
Core 15:
temp17_input: 32.000
temp17_max: 80.000
temp17_crit: 98.000
temp17_crit_alarm: 0.000
I can absolutely sure two CPUs are the same because I personally installed them onto the socket, unless I was cheated by the seller~
My /proc/cpuinfo
file is here: cpuinfo.txt
Thank you for the quick feedback.
I did some study of the documentation of the coretemp stuff and it seems the main issue in htop comes down to how the sensors are mapped onto the actual CPU cores. This will likely take a bit of work, as currently the information related to the cpuinfo (and thus core layout) is not kept for correlation in the libsensors code.
Also, the libsensors code assumes the core IDs to be contiguous, which is clearly not the case with the example by @AnOpenSauceDev. The second issue arises with multiple coretemp instances due to multiple CPUs present in the system. Both being issues that can be resolved when properly mapping the core IDs of the coretemp instances to the physical CPU cores available from /proc/cpuinfo
.
@cgzones Can you please take a look at refactoring the libsensors code? Would be nice if we could implement some proper mapping of sensors to their physical cores.
The heuristic could still remain similar to what it is now, being all cores inherit Tctrl, Tdie followed by Tccd{X}, with only parts of the information cleared out, if multiple readings are available on the same core (e.g. acpitz + coretemp). If acpitz gives temperatures for cores not covered by coretemp, those should still keep the acpitz readings.
References:
Thanks again for your attention!!!
I referred codes in linux/LibSensors.c
and created an simple test.
int main() {
sensors_init(NULL);
int n = 0;
for (const sensors_chip_name* chip = sensors_get_detected_chips(NULL, &n);chip; chip = sensors_get_detected_chips(NULL, &n)){
cout<<"SENSOR:"<<chip->prefix<<endl;
int m=0;
for(const sensors_feature* feature = sensors_get_features(chip, &m);feature; feature = sensors_get_features(chip, &m)){
cout<<" name "<<feature->name<<endl;
}
}
}
Here is the output: (It is run on another dual socket server, Dual Xeon E5-2643 V3 , which has the same problem)
SENSOR:coretemp
name temp1
name temp2
name temp3
name temp4
name temp5
name temp6
name temp7
SENSOR:amdgpu
name in0
name fan1
name temp1
name power1
SENSOR:i350bb
name temp1
SENSOR:nvme
name temp1
SENSOR:coretemp
name temp1
name temp2
name temp3
name temp4
name temp5
name temp6
name temp7
SENSOR:power_meter
name power1
There are two SENSORs named coretemp
. They have completely same feature. However in the Line 185 of linux/LibSensors.c
:
unsigned long int tempID = strtoul(feature->name + strlen("temp"), NULL, 10);
tempID
is assigned by the number followed by temp
. The bug would appear when updating the temp of the second CPU because of the FAULT tempID
. For example temp1 of the second CPU should be stored at the index of 6 of the cpu temp array. As a result, temp value of the first CPU are overwrite by the second.
step 0:
In Machine.h
add value CPUsockets
in the structure of Machine
step 1:
In linux/LinuxMachine.c
, add function 'LinuxMachine_updateCPUsockets' to get the value of CPUsockets
by open
/sys/devices/system/node/has_cpu
Take my machine as an example, the output is '0-1'. I guess in single socket system, it might be '0-0'.
step 2:
In linux/LibSensors.c
, add value bias=existingCPUs/CPUsockets
and int current_CPUsocket=0
at the beginning of the function LibSensors_getCPUTemperatures
. Besides, charge the Line 185 to:
unsigned long int tempID = strtoul(feature->name + strlen("temp"), NULL, 10)+bias*current_CPUsocket;
Don't forget to add current_CPUsocket++
after the Line 211!
step 3:
Change Line 256-262. Update the temp socket by socket. (I suggest reading the file /sys/devices/system/cpu/smt/active
for SMT/HT judgment).
Since I have been a little busy at work recently, the code has not been implemented on the original project(i am so sorry TOT). Hope my suggestions would be adopted!
That's still incomplete. because your solution does not properly track, which instance of coretemp
is associated with which physical CPU. Overall it's not as simple as laid out, because you need track the topology; which is currently unimplemented.
In the past two days, I have consulted the source code of the hwmon subsystem and lm-sensors, and tested it with numactl (which is able to force the task to run on a certain CPU core).
First of all, the lm-sensors reading method of increasing the tempX by suffix number in each hwmon group exactly corresponds to the sequential increase of the core id (at least on my three machines), and there is no exception of out-of-order correspondence as you described.
In addition, for multi-socket motherboards, I also tested and verified the one-to-one correspondence between the CPU socket ID number and nodeX
in the system directory (at least this is true on dual-socket motherboards, I don't have four-socket and above motherboards to test). Actually property addr
of the structure sensors_chip_name
in the library lm_sensors also indicates the actual CPU socket ID. By the way, sub-folders nodeX
is under the folder /sys/devices/system/node/
. Under these sub-folders, a file named cpulist
describes core IDs in the system that bind to each physical CPU. That completely solves the problem of the ownership of the system CPU core ID to the CPU socket ID.
To sum up, we could first use the cpulist
files to deduce which CPU socket the core belongs to according to the cpuX suffix X in /proc/stat, and then associate the hwmon group to the corresponding CPU socket through the addr
attribute sensors_chip_name
in lm-sensors. In a single hwmon group, call the lm-sensors API and read the temperature of each core in order.
Hope my suggestions would be adopted!
@BenBE @AnOpenSauceDev I browse the pull request list and find she has done what I want. #1352
I also test her fork using the same method. As is shown below, the problem has been solved. Besides, core ID and its temperature are correctly corresponded. Hope this PR will be accepted!
When using
htop
via SSH on my Ubuntu server, i notice that even if i enableAlso show CPU temperature
(libsensors5 is installed), no temperature reading appears. I'm unsure if this is because of my core count or not (40 threads total), but no matter what i do, nothing will show up alongside the usage reading.