ocerman / zenpower

Zenpower is Linux kernel driver for reading temperature, voltage(SVI2), current(SVI2) and power(SVI2) for AMD Zen family CPUs.
GNU General Public License v2.0
452 stars 45 forks source link

Multi-CPU support #24

Closed ocerman closed 4 years ago

ocerman commented 4 years ago

I believe @harrykipper 's system have 2x epyc https://github.com/ocerman/zenpower/issues/12#issuecomment-588335876, but SVI2 sensors are displayed only for 1. CPU

ocerman commented 4 years ago

Latest commit should bring SVI2 values also for second CPU. @harrykipper can try it?

harrykipper commented 4 years ago

Hi, interestingly sometimes I see SVI2 values for both CPUs, other times just as it was before.

sensors zenpower-*

zenpower-pci-00eb Adapter: PCI adapter SVI2_Core: 688.00 mV Tdie: +29.5°C (high = +95.0°C) Tctl: +29.5°C
SVI2_P_Core: 714.83 mW SVI2_C_Core: 1.04 A

zenpower-pci-00db Adapter: PCI adapter Tdie: +31.0°C (high = +95.0°C) Tctl: +31.0°C

zenpower-pci-00fb Adapter: PCI adapter Tdie: +28.2°C (high = +95.0°C) Tctl: +28.2°C

zenpower-pci-00cb Adapter: PCI adapter SVI2_Core: 694.00 mV Tdie: +31.0°C (high = +95.0°C) Tctl: +31.0°C
SVI2_P_Core: 721.07 mW SVI2_C_Core: 1.04 A

zenpower-pci-00f3 Adapter: PCI adapter Tdie: +29.0°C (high = +95.0°C) Tctl: +29.0°C

zenpower-pci-00e3 Adapter: PCI adapter SVI2_SoC: 944.00 mV Tdie: +29.5°C (high = +95.0°C) Tctl: +29.5°C
Tccd1: +30.0°C
Tccd2: +29.0°C
Tccd3: +28.5°C
SVI2_P_SoC: 22.14 W
SVI2_C_SoC: 23.45 A

zenpower-pci-00d3 Adapter: PCI adapter Tdie: +31.2°C (high = +95.0°C) Tctl: +31.2°C

zenpower-pci-00c3 Adapter: PCI adapter SVI2_SoC: 944.00 mV Tdie: +31.2°C (high = +95.0°C) Tctl: +31.2°C
Tccd1: +31.5°C
Tccd2: +31.8°C
Tccd3: +31.2°C
SVI2_P_SoC: 21.11 W
SVI2_C_SoC: 22.37 A

A moment later I have :

sensors zenpower-*

zenpower-pci-00eb Adapter: PCI adapter SVI2_Core: 694.00 mV Tdie: +29.4°C (high = +95.0°C) Tctl: +29.4°C
SVI2_P_Core: 0.00 W
SVI2_C_Core: 0.00 A

zenpower-pci-00db Adapter: PCI adapter Tdie: +31.0°C (high = +95.0°C) Tctl: +31.0°C

zenpower-pci-00fb Adapter: PCI adapter Tdie: +28.0°C (high = +95.0°C) Tctl: +28.0°C

zenpower-pci-00cb Adapter: PCI adapter SVI2_Core: 688.00 mV Tdie: +30.8°C (high = +95.0°C) Tctl: +30.8°C
SVI2_P_Core: 714.83 mW SVI2_C_Core: 1.04 A

zenpower-pci-00f3 Adapter: PCI adapter Tdie: +28.8°C (high = +95.0°C) Tctl: +28.8°C

zenpower-pci-00e3 Adapter: PCI adapter SVI2_SoC: 944.00 mV Tdie: +29.5°C (high = +95.0°C) Tctl: +29.5°C
Tccd1: +29.8°C
Tccd2: +29.0°C
Tccd3: +28.2°C
SVI2_P_SoC: 22.14 W
SVI2_C_SoC: 23.45 A

zenpower-pci-00d3 Adapter: PCI adapter Tdie: +31.2°C (high = +95.0°C) Tctl: +31.2°C

zenpower-pci-00c3 Adapter: PCI adapter SVI2_SoC: 944.00 mV Tdie: +31.2°C (high = +95.0°C) Tctl: +31.2°C
Tccd1: +31.5°C
Tccd2: +31.2°C
Tccd3: +31.2°C
SVI2_P_SoC: 20.77 W
SVI2_C_SoC: 22.01 A

./zp_read_debug.sh

KERN_SUP: 1 NODE7; CPU1; N/CPU: 4 0005a008 = 00000002 0005a00c = 00000000 0005a010 = 00000000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000000 00059958 = 00000000 0005995c = 00000000 00059960 = 00000000 00059964 = 08400001 00059968 = 0000381c 0005996c = 0000003e 00059970 = c0800005 KERN_SUP: 1 NODE0; CPU0; N/CPU: 4 0005a008 = 00000002 0005a00c = 01620040 0005a010 = 01f70000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000a86 00059958 = 00000a82 0005995c = 00000a80 00059960 = 00000000 00059964 = 08400001 00059968 = 00003e1f 0005996c = 00000040 00059970 = c0800005 KERN_SUP: 1 NODE1; CPU0; N/CPU: 4 0005a008 = 00000002 0005a00c = 018a0001 0005a010 = 01f70000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000000 00059958 = 00000000 0005995c = 00000000 00059960 = 00000000 00059964 = 08400001 00059968 = 00003e1f 0005996c = 00000041 00059970 = c0800005 KERN_SUP: 1 NODE2; CPU0; N/CPU: 4 0005a008 = 00000002 0005a00c = 00000000 0005a010 = 00000000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000000 00059958 = 00000000 0005995c = 00000000 00059960 = 00000000 00059964 = 08400001 00059968 = 00003e1f 0005996c = 00000043 00059970 = c0800005 KERN_SUP: 1 NODE3; CPU0; N/CPU: 4 0005a008 = 00000002 0005a00c = 00000000 0005a010 = 00000000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000000 00059958 = 00000000 0005995c = 00000000 00059960 = 00000000 00059964 = 08400001 00059968 = 00003c1e 0005996c = 00000042 00059970 = c0800005 KERN_SUP: 1 NODE4; CPU1; N/CPU: 4 0005a008 = 00000002 0005a00c = 01620042 0005a010 = 01f70000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000a78 00059958 = 00000a72 0005995c = 00000a70 00059960 = 00000000 00059964 = 08400001 00059968 = 0000361b 0005996c = 0000003f 00059970 = c0800005 KERN_SUP: 1 NODE5; CPU1; N/CPU: 4 0005a008 = 00000002 0005a00c = 018a0000 0005a010 = 01f70000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000000 00059958 = 00000000 0005995c = 00000000 00059960 = 00000000 00059964 = 08400001 00059968 = 00003a1d 0005996c = 00000043 00059970 = c0800005 KERN_SUP: 1 NODE6; CPU1; N/CPU: 4 0005a008 = 00000002 0005a00c = 00000000 0005a010 = 00000000 0005a014 = 00000000 000598bc = 0fff00ff 0005994c = 00000000 00059954 = 00000000 00059958 = 00000000 0005995c = 00000000 00059960 = 00000000 00059964 = 08400001 00059968 = 0000381c 0005996c = 00000042 00059970 = c0800005

ocerman commented 4 years ago

@harrykipper thanks for testing it.

It looks good to me. Core is displayed both times at zenpower-pci-00eb, zenpower-pci-00cb; before only at zenpower-pci-00cb And SOC both times at zenpower-pci-00e3, zenpower-pci-00c3; before only at zenpower-pci-00c3.

And for core/power being sometimes 0 - as raw current values do not have high resolution, they can be rounded to 0 when the current is very low. Also: representation of raw current values can differ between boards, so cannot guarantee that current/power reading are always accurate.

abucodonosor commented 4 years ago

@ocerman

Here data from my home server. Board Supermicro H11DSi-NT, 2 * AMD EPYC 7281 16C

Sensors output:

crazy@ant:~/zenpower$ sensors zenpower-*
zenpower-pci-00f3
Adapter: PCI adapter
Tdie:         +21.5°C  (high = +95.0°C)
Tctl:         +21.5°C  

zenpower-pci-00e3
Adapter: PCI adapter
SVI2_SoC:    944.00 mV 
Tdie:         +23.5°C  (high = +95.0°C)
Tctl:         +23.5°C  
Tccd1:        +23.5°C  
Tccd2:        +21.8°C  
Tccd3:        +22.8°C  
SVI2_P_SoC:   20.77 W  
SVI2_C_SoC:   22.01 A  

zenpower-pci-00d3
Adapter: PCI adapter
Tdie:         +25.6°C  (high = +95.0°C)
Tctl:         +25.6°C  

zenpower-pci-00c3
Adapter: PCI adapter
SVI2_SoC:    944.00 mV 
Tdie:         +29.2°C  (high = +95.0°C)
Tctl:         +29.2°C  
Tccd1:        +27.0°C  
Tccd2:        +26.5°C  
Tccd3:        +26.5°C  
SVI2_P_SoC:   20.43 W  
SVI2_C_SoC:   21.65 A  

zenpower-pci-00fb
Adapter: PCI adapter
Tdie:         +22.8°C  (high = +95.0°C)
Tctl:         +22.8°C  

zenpower-pci-00eb
Adapter: PCI adapter
SVI2_Core:   700.00 mV 
Tdie:         +23.5°C  (high = +95.0°C)
Tctl:         +23.5°C  
SVI2_P_Core:   0.00 W  
SVI2_C_Core:   0.00 A  

zenpower-pci-00db
Adapter: PCI adapter
Tdie:         +26.2°C  (high = +95.0°C)
Tctl:         +26.2°C  

zenpower-pci-00cb
Adapter: PCI adapter
SVI2_Core:   950.00 mV 
Tdie:         +26.5°C  (high = +95.0°C)
Tctl:         +26.5°C  
SVI2_P_Core:   4.94 W  
SVI2_C_Core:   5.20 A  

Debug output:

crazy@ant:~/zenpower$ sh ./zenpower_debug.sh
KERN_SUP: 1
NODE0; CPU0; N/CPU: 4
0005a008 = 00000002
0005a00c = 0161003c
0005a010 = 01f70000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000a60
00059958 = 00000a56
0005995c = 00000a58
00059960 = 00000000
00059964 = 08400001
00059968 = 00003a1d
0005996c = 0000002c
00059970 = c0800005
KERN_SUP: 1
NODE1; CPU0; N/CPU: 4
0005a008 = 00000002
0005a00c = 01610003
0005a010 = 01f70000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000000
00059958 = 00000000
0005995c = 00000000
00059960 = 00000000
00059964 = 08400001
00059968 = 0000341a
0005996c = 0000002b
00059970 = c0800005
KERN_SUP: 1
NODE2; CPU0; N/CPU: 4
0005a008 = 00000002
0005a00c = 00000000
0005a010 = 00000000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000000
00059958 = 00000000
0005995c = 00000000
00059960 = 00000000
00059964 = 08400001
00059968 = 00003219
0005996c = 0000002b
00059970 = c0800005
KERN_SUP: 1
NODE3; CPU0; N/CPU: 4
0005a008 = 00000002
0005a00c = 00000000
0005a010 = 00000000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000000
00059958 = 00000000
0005995c = 00000000
00059960 = 00000000
00059964 = 08400001
00059968 = 0000341a
0005996c = 0000002a
00059970 = c0800005
KERN_SUP: 1
NODE4; CPU1; N/CPU: 4
0005a008 = 00000002
0005a00c = 0161003d
0005a010 = 01f70000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000a46
00059958 = 00000a38
0005995c = 00000a40
00059960 = 00000000
00059964 = 08400001
00059968 = 00002e17
0005996c = 00000026
00059970 = c0800005
KERN_SUP: 1
NODE5; CPU1; N/CPU: 4
0005a008 = 00000002
0005a00c = 01880000
0005a010 = 01f70000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000000
00059958 = 00000000
0005995c = 00000000
00059960 = 00000000
00059964 = 08400001
00059968 = 00002e17
0005996c = 00000028
00059970 = c0800005
KERN_SUP: 1
NODE6; CPU1; N/CPU: 4
0005a008 = 00000002
0005a00c = 00000000
0005a010 = 00000000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000000
00059958 = 00000000
0005995c = 00000000
00059960 = 00000000
00059964 = 08400001
00059968 = 00002814
0005996c = 00000024
00059970 = c0800005
KERN_SUP: 1
NODE7; CPU1; N/CPU: 4
0005a008 = 00000002
0005a00c = 00000000
0005a010 = 00000000
0005a014 = 00000000
000598bc = 0fff00ff
0005994c = 00000000
00059954 = 00000000
00059958 = 00000000
0005995c = 00000000
00059960 = 00000000
00059964 = 08400001
00059968 = 00002c16
0005996c = 00000025
00059970 = c0800005

Here is what IPMI sensors report:

crazy@ant:~/zenpower$ sudo ipmi-sensors | grep -v 'N/A'
ID   | Name            | Type              | Reading    | Units | Event
4    | CPU1 Temp       | Temperature       | 29.00      | C     | 'OK'
71   | CPU2 Temp       | Temperature       | 23.00      | C     | 'OK'
138  | System Temp     | Temperature       | 30.00      | C     | 'OK'
205  | Peripheral Temp | Temperature       | 38.00      | C     | 'OK'
272  | MB_10G Temp     | Temperature       | 55.00      | C     | 'OK'
339  | VRMCpu1 Temp    | Temperature       | 35.00      | C     | 'OK'
406  | VRMCpu2 Temp    | Temperature       | 35.00      | C     | 'OK'
473  | VRMSoc1 Temp    | Temperature       | 40.00      | C     | 'OK'
540  | VRMSoc2 Temp    | Temperature       | 45.00      | C     | 'OK'
607  | VRMP1ABCD Temp  | Temperature       | 38.00      | C     | 'OK'
674  | VRMP1EFGH Temp  | Temperature       | 38.00      | C     | 'OK'
741  | VRMP2ABCD Temp  | Temperature       | 38.00      | C     | 'OK'
808  | VRMP2EFGH Temp  | Temperature       | 35.00      | C     | 'OK'
942  | P1-DIMMB1 Temp  | Temperature       | 34.00      | C     | 'OK'
1076 | P1-DIMMD1 Temp  | Temperature       | 36.00      | C     | 'OK'
1478 | P2-DIMMB1 Temp  | Temperature       | 35.00      | C     | 'OK'
1612 | P2-DIMMD1 Temp  | Temperature       | 36.00      | C     | 'OK'
1947 | FAN1            | Fan               | 500.00     | RPM   | 'OK'
2014 | FAN2            | Fan               | 500.00     | RPM   | 'OK'
2215 | FAN5            | Fan               | 600.00     | RPM   | 'OK'
2282 | FAN6            | Fan               | 400.00     | RPM   | 'OK'
2349 | FANA            | Fan               | 1300.00    | RPM   | 'OK'
2416 | FANB            | Fan               | 1300.00    | RPM   | 'OK'
2483 | 12V             | Voltage           | 12.11      | V     | 'OK'
2550 | 5VCC            | Voltage           | 5.00       | V     | 'OK'
2617 | 3.3VCC          | Voltage           | 3.27       | V     | 'OK'
2751 | P1_VDDCR        | Voltage           | 0.95       | V     | 'OK'
2818 | P1_VMEMABCD     | Voltage           | 1.24       | V     | 'OK'
2885 | P2_VDDCR        | Voltage           | 0.72       | V     | 'OK'
2952 | P1_VMEMEFGH     | Voltage           | 1.24       | V     | 'OK'
3019 | VDD_5_DUAL      | Voltage           | 4.89       | V     | 'OK'
3086 | VDD_33_DUAL     | Voltage           | 3.30       | V     | 'OK'
3153 | P2_VMEMABCD     | Voltage           | 1.24       | V     | 'OK'
3220 | P2_VMEMEFGH     | Voltage           | 1.23       | V     | 'OK'
3287 | P1_SOCRUN       | Voltage           | 0.98       | V     | 'OK'
3354 | P2_SOCRUN       | Voltage           | 0.94       | V     | 'OK'
3421 | P1_SOCDUAL      | Voltage           | 0.90       | V     | 'OK'
3488 | P2_SOCDUAL      | Voltage           | 0.90       | V     | 'OK'

If you need any kind testing please ping me. I'll take even experimental stuff, debug patches for future development and whatever else you may need.

Thx for developing this module.

ocerman commented 4 years ago

@abucodonosor thanks for testing. It looks it is working fine. Both Core and SOC values are present for both CPUs. Both Core/SoC voltages are fine. The current/power is probably not very accurate, but that is known issue.

I will do another commit with updated sensor labels.

abucodonosor commented 4 years ago

@ocerman

To me, it seems SoC Voltage is read from the second CPU only, which can be checked at least from ipmi sensors.

crazy@ant:~/zenpower$ sensors zenpower-* | grep SVI2_SoC
SVI2_SoC:    938.00 mV 
SVI2_SoC:    938.00 mV 
crazy@ant:~/zenpower$ sudo ipmi-sensors | grep SOCRUN
3287 | P1_SOCRUN       | Voltage           | 0.98       | V     | 'OK'
3354 | P2_SOCRUN       | Voltage           | 0.94       | V     | 'OK'

That is 0.938V which matches P2_SOCRUN Voltage.

As for power and current, it may also depend on how the BIOS is set up, performance, powersave etc. But that is not much of an issue right now, at least it works.

ocerman commented 4 years ago

@abucodonosor can you try last commit? it should add cpu number to sensor labels when dual cpus are installed.

and for the soc: unfortunately don't know how to fix that.

abucodonosor commented 4 years ago

@ocerman

Already did that, looks good. Here is the output:


crazy@ant:~/zenpower$ sensors zenpower-*               
zenpower-pci-00f3
Adapter: PCI adapter
cpu1 Tdie:    +19.0°C  (high = +95.0°C)
cpu1 Tctl:    +19.0°C  

zenpower-pci-00e3
Adapter: PCI adapter
cpu1 SVI2_SoC:   944.00 mV 
cpu1 Tdie:        +20.5°C  (high = +95.0°C)
cpu1 Tctl:        +20.5°C  
cpu1 Tccd1:       +20.5°C  
cpu1 Tccd2:       +19.0°C  
cpu1 Tccd3:       +19.8°C  
cpu1 SVI2_P_SoC:  20.77 W  
cpu1 SVI2_C_SoC:  22.01 A  

zenpower-pci-00d3
Adapter: PCI adapter
cpu0 Tdie:    +21.6°C  (high = +95.0°C)
cpu0 Tctl:    +21.6°C  

zenpower-pci-00c3
Adapter: PCI adapter
cpu0 SVI2_SoC:   944.00 mV 
cpu0 Tdie:        +23.5°C  (high = +95.0°C)
cpu0 Tctl:        +23.5°C  
cpu0 Tccd1:       +21.5°C  
cpu0 Tccd2:       +21.8°C  
cpu0 Tccd3:       +21.8°C  
cpu0 SVI2_P_SoC:  20.77 W  
cpu0 SVI2_C_SoC:  22.01 A  

zenpower-pci-00fb
Adapter: PCI adapter
cpu1 Tdie:    +19.5°C  (high = +95.0°C)
cpu1 Tctl:    +19.5°C  

zenpower-pci-00eb
Adapter: PCI adapter
cpu1 SVI2_Core:   969.00 mV 
cpu1 Tdie:         +20.2°C  (high = +95.0°C)
cpu1 Tctl:         +20.2°C  
cpu1 SVI2_P_Core:   1.01 W  
cpu1 SVI2_C_Core:   1.04 A  

zenpower-pci-00db
Adapter: PCI adapter
cpu0 Tdie:    +21.2°C  (high = +95.0°C)
cpu0 Tctl:    +21.2°C  

zenpower-pci-00cb
Adapter: PCI adapter
cpu0 SVI2_Core:   957.00 mV 
cpu0 Tdie:         +21.2°C  (high = +95.0°C)
cpu0 Tctl:         +21.2°C  
cpu0 SVI2_P_Core:   3.98 W  
cpu0 SVI2_C_Core:   4.16 A