intel / powertelemetry

Internal sources of Power Telemetry Library. Power Telemetry Library is a golang library that provides power-related CPU info.
Apache License 2.0
8 stars 2 forks source link

RAPL package ID validation is unnecessary and breaks collections #3

Open koallen opened 9 months ago

koallen commented 9 months ago

https://github.com/intel/powertelemetry/blob/5279ae9e8994ee1ecdb98a7ec8e2a0a20bd9e542/rapl.go#L288-L290

This code compares the package ID parsed from path and from the name file. However, on machines with the psys RAPL domain, we may have intel-rapl:2 in the path and the name could be package-0, depending on how it's enumerated. Having that comparison will return an error in such case and we couldn't collect any stuff for "package-0"

koallen commented 9 months ago

Example

/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/name:package-0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/name:psys
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/name:package-1

And when using this library, we get the following error

error validating package domain zone: package ID mismatch between zone path "/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2" and zone name "package-1"
p-zak commented 9 months ago

@koallen Thanks for reporting the issue.

Can you provide output of the following command?

find /sys/devices/virtual/powercap/intel-rapl/ -type f -print -exec cat {} \;

It will print the path to each file (inside intel-rapl subsystem) along with its value.

Additionally, could you provide the output of the following commands?

koallen commented 9 months ago
$ find /sys/devices/virtual/powercap/intel-rapl/ -type f -print -exec cat {} \;
/sys/devices/virtual/powercap/intel-rapl/uevent
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/uevent
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/energy_uj
60293987000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/enabled
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_1_max_power_uw
cat: '/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_1_max_power_uw': No data available
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/power/runtime_active_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/power/runtime_status                                                                                  
unsupported
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/power/autosuspend_delay_ms
cat: '/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/power/autosuspend_delay_ms': Input/output error
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/power/runtime_suspended_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/power/control
auto
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_1_time_window_us
976
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_1_power_limit_uw
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_0_time_window_us
976
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_1_name
short_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_0_power_limit_uw
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_0_name
long_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/name
psys
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_0_max_power_uw
cat: '/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/constraint_0_max_power_uw': No data available
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/max_energy_range_uj
4294967295000000
/sys/devices/virtual/powercap/intel-rapl/enabled
1
/sys/devices/virtual/powercap/intel-rapl/power/runtime_active_time
0
/sys/devices/virtual/powercap/intel-rapl/power/runtime_status
unsupported
/sys/devices/virtual/powercap/intel-rapl/power/autosuspend_delay_ms
cat: /sys/devices/virtual/powercap/intel-rapl/power/autosuspend_delay_ms: Input/output error
/sys/devices/virtual/powercap/intel-rapl/power/runtime_suspended_time
0
/sys/devices/virtual/powercap/intel-rapl/power/control
auto
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/uevent
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/energy_uj
164708768957
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/enabled
1
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_1_max_power_uw
764000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/power/runtime_active_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/power/runtime_status
unsupported
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/power/autosuspend_delay_ms
cat: '/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/power/autosuspend_delay_ms': Input/output error
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/power/runtime_suspended_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/power/control
auto
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_1_time_window_us
11712
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/uevent
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/energy_uj
4086542912
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/enabled
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/power/runtime_active_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/power/runtime_status
unsupported
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/power/autosuspend_delay_ms
cat: '/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/power/autosuspend_delay_ms': Input/output error
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/power/runtime_suspended_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/power/control
auto
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/constraint_0_time_window_us
976
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/constraint_0_power_limit_uw
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/constraint_0_name
long_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/name
dram
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/constraint_0_max_power_uw
122000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/intel-rapl:2:0/max_energy_range_uj
65712999613
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_1_power_limit_uw
420000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_0_time_window_us
999424
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_1_name
short_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_0_power_limit_uw
350000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_0_name
long_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/name
package-1
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/constraint_0_max_power_uw
350000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/max_energy_range_uj
262143328850
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/uevent
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/energy_uj
205249585765
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/uevent
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/energy_uj
1998638617
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/enabled
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/power/runtime_active_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/power/runtime_status
unsupported
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/power/autosuspend_delay_ms
cat: '/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/power/autosuspend_delay_ms': Input/output error
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/power/runtime_suspended_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/power/control
auto
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/constraint_0_time_window_us
976
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/constraint_0_power_limit_uw
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/constraint_0_name
long_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/name
dram
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/constraint_0_max_power_uw
122000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/intel-rapl:0:0/max_energy_range_uj
65712999613
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/enabled
1
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_1_max_power_uw
764000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/power/runtime_active_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/power/runtime_status
unsupported
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/power/autosuspend_delay_ms
cat: '/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/power/autosuspend_delay_ms': Input/output error
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/power/runtime_suspended_time
0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/power/control
auto
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_1_time_window_us
11712
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_1_power_limit_uw
420000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_0_time_window_us
999424
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_1_name
short_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_0_power_limit_uw
350000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_0_name
long_term
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/name
package-0
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/constraint_0_max_power_uw
350000000
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:0/max_energy_range_uj
262143328850

lscpu and uname

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              56
On-line CPU(s) list: 0-55
Thread(s) per core:  1
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8480+
BIOS Model name:     Intel(R) Xeon(R) Platinum 8480+
Stepping:            8
CPU MHz:             2000.000
BogoMIPS:            4000.00
L1d cache:           48K
L1i cache:           32K
L2 cache:            2048K
L3 cache:            107520K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdts
cp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 sss
e3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed ad
x avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_
detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx51
2_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 fl
ush_l1d arch_capabilities

$ uname -a
Linux <HOSTNAME> 4.18.0-372.32.1.el8_6.x86_64 #1 SMP Fri Oct 7 12:35:10 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
koallen commented 9 months ago

As you can see,

/sys/devices/virtual/powercap/intel-rapl/intel-rapl:1/name
psys
/sys/devices/virtual/powercap/intel-rapl/intel-rapl:2/name
package-1

I think it's because psys belongs to a "master package" so when enumerating /sys, psys gets enumerated right after the master package (package-0 in this case) and hence it appears before package-1.

p-zak commented 9 months ago

@koallen Thank you very much for all the outputs.

It seems like you're right about the root cause, but I wanted to have a complete picture of the situation in your RAPL subsystem to conduct a thorough analysis.