powerapi-ng / hwpc-sensor

Hardware Performance Counters monitoring agent for containers.
BSD 3-Clause "New" or "Revised" License
14 stars 16 forks source link

Issue: "event 'TSC' is invalid or unsupported by this machine" #27

Closed PierreRustOrange closed 9 months ago

PierreRustOrange commented 2 years ago

I have an issue on a server where the sensor fails with following message : Could not get encoding for event 'TSC' : code -4.

Issues similar to this one have already been raised, but it is not the same problem than in #1 or #25 as the sensor is built here with the patched version of libpfm4 and I have been using the same sensor's container image successfully on other servers.

I suspect it has something to do with the version of the kernel and / or the generation of cpu used, but I couldn't find anything obvious in the libpfm4 source.

This is how the sensor is started:

/usr/bin/hwpc-sensor
     -n "sensor_$NODE_NAME" \
     -f "$REPORT_FREQ" \
     -r socket -U 127.0.0.1 -P  12000 \
     -s "rapl" -o -e "RAPL_ENERGY_PKG" \
     -s "msr"     -e "TSC" -e "APERF" -e "MPERF" \
     -c "core"    -e "CPU_CLK_THREAD_UNHALTED:REF_P" \
                  -e "CPU_CLK_THREAD_UNHALTED:THREAD_P" \
                  -e "LLC_MISSES"\
                  -e "INSTRUCTIONS_RETIRED"

And here is the full output of the sensor :

I: 22-07-26 10:15:23 build: version unknown (rev: unknown)
I: 22-07-26 10:15:23 uname: Linux 3.10.0-693.2.2.rt56.623.el7.x86_64 #1 SMP PREEMPT RT Thu Sep 14 16:53:49 CEST 2017 x86_64
I: 22-07-26 10:15:23 pmu: found ix86arch 'Intel X86 architectural PMU' having 7 events, 7 counters (4 general, 3 fixed)
I: 22-07-26 10:15:23 pmu: found perf 'perf_events generic PMU' having 133 events, 0 counters (0 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found rapl 'Intel RAPL' having 2 events, 3 counters (0 general, 3 fixed)
I: 22-07-26 10:15:23 pmu: found perf_raw 'perf_events raw PMU' having 1 events, 0 counters (0 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx 'Intel Skylake X' having 85 events, 11 counters (8 general, 3 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha0 'Intel SkylakeX CHA0 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha1 'Intel SkylakeX CHA1 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha2 'Intel SkylakeX CHA2 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha3 'Intel SkylakeX CHA3 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha4 'Intel SkylakeX CHA4 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha5 'Intel SkylakeX CHA5 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha6 'Intel SkylakeX CHA6 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha7 'Intel SkylakeX CHA7 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha8 'Intel SkylakeX CHA8 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha9 'Intel SkylakeX CHA9 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha10 'Intel SkylakeX CHA10 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha11 'Intel SkylakeX CHA11 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha12 'Intel SkylakeX CHA12 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha13 'Intel SkylakeX CHA13 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha14 'Intel SkylakeX CHA14 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha15 'Intel SkylakeX CHA15 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha16 'Intel SkylakeX CHA16 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha17 'Intel SkylakeX CHA17 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha18 'Intel SkylakeX CHA18 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha19 'Intel SkylakeX CHA19 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha20 'Intel SkylakeX CHA20 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha21 'Intel SkylakeX CHA21 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha22 'Intel SkylakeX CHA22 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha23 'Intel SkylakeX CHA23 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha24 'Intel SkylakeX CHA24 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha25 'Intel SkylakeX CHA25 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha26 'Intel SkylakeX CHA26 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_cha27 'Intel SkylakeX CHA27 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_iio0 'Intel SkylakeX IIO0 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_iio1 'Intel SkylakeX IIO1 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_iio2 'Intel SkylakeX IIO2 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_iio3 'Intel SkylakeX IIO3 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_iio4 'Intel SkylakeX IIO4 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_imc0 'Intel SkylakeX IMC0 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_imc1 'Intel SkylakeX IMC1 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_imc2 'Intel SkylakeX IMC2 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_imc3 'Intel SkylakeX IMC3 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_imc4 'Intel SkylakeX IMC4 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_imc5 'Intel SkylakeX IMC5 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_m2m0 'Intel SkylakeX M2M0 uncore' having 121 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_m2m1 'Intel SkylakeX M2M1 uncore' having 121 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_m3upi0 'Intel SkylakeX M3UPI0 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_m3upi1 'Intel SkylakeX M3UPI1 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_m3upi2 'Intel SkylakeX M3UPI2 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_pcu 'Intel SkylakeX PCU uncore' having 29 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_ubo 'Intel SkylakeX U-Box uncore' having 5 events, 3 counters (2 general, 1 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_upi0 'Intel SkylakeX UPI0 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_upi1 'Intel SkylakeX UPI1 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-07-26 10:15:23 pmu: found skx_unc_upi2 'Intel SkylakeX UPI2 uncore' having 34 events, 4 counters (4 general, 0 fixed)
W: 22-07-26 10:15:23 Could not get encoding for event 'TSC' : code -4 
E: 22-07-26 10:15:23 config: event 'TSC' is invalid or unsupported by this machine
E: 22-07-26 10:15:23 config: failed to parse the provided command-line arguments

The CPU on this serveur is a Xeon Gold 6142, here is what lscpu returns:

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
Stepping:              4
CPU MHz:               2600.000
BogoMIPS:              5200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              22528K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

any idea ?

gfieni commented 2 years ago

Hello, This is a very weird problem indeed. I don't remember having a problem with this CPU model on Centos 7. I see that this machine uses a Real-Time Kernel. I never tested the sensor on this kind of kernel. I will do some tests with a machine that have a very close (Intel Xeon Gold 6130) model of this CPU and get back to you.

PierreRustOrange commented 2 years ago

Thanks for your feedback, it's quite weird indeed. I've ruled out the CPU : we have another server with the exact same CPU where the sensor is running fine. It's probably an issue with the OS / kernel thus.

When I look into /sys/devices I don't even have a msr sub-directory !

gfieni commented 2 years ago

Hello, The problem is the missing support of the msr perf_event PMU by this kernel version. The libpfm library throws this error because it needs to read the /sys/devices/<pmu>/type and /sys/devices/<pmu>/events/<event> files to setup some PMUs.

I tested your kernel-rt version (3.10.0-693.2.2.rt56.623.el7) and the /sys/devices/msr is also missing and I have the same error as you. I also tried with the next version (3.10.0-693.11.1.rt56.632.el7) but it wasn't working either. The closest version of the kernel-rt where the /sys/devices/msr is present and the sensor works correctly is the 3.10.0-957.1.3.rt56.913.el7. The latest version of the kernel-rt (3.10.0-1160.42.2.rt56.1182.el7) works too.

Unfortunately, it seems that you cannot fix this problem without upgrading the kernel of this machine.

PierreRustOrange commented 2 years ago

Thanks a lot for this analysis ! I'll see if I can upgrade the kernel but that might be complicated ...