powerapi-ng / hwpc-sensor

Hardware Performance Counters monitoring agent for containers.
BSD 3-Clause "New" or "Revised" License
14 stars 16 forks source link

Access to RAPL counters on some CPU / kernel combinaison #20

Open PierreRustOrange opened 2 years ago

PierreRustOrange commented 2 years ago

On some system, the sensor fails to access RAPL counters and we get this error at startup:

E: 21-12-07 11:14:26 config: event 'RAPL_ENERGY_PKG' is invalid or unsupported by this machine

However, on the same systems, we can see rapl data in the powercap sysfs.

powerapi-ng/powerapi#125 is probably an example of such error.

Actually the sensor use the perf subsystem to access rapl, which is implemented in a different part of the kernel source tree than powercap. Thus I suspect that this can happens when the kernel contains, for the cpu of the machine, the implementation of powercap but not of rapl access in perf.

I suggest we implement a fallback access to RAPL using powercap sysfs, when we cannot use perf.

PierreRustOrange commented 2 years ago

It seems this can also happen if the appropriate kernel module is not loaded. For ubuntu, the module is in the linux-modules-extra package

apt install linux-modules-extra-$(uname -r)
update-initramfs -c -k $(uname -r)

The module is in /usr/lib/modules/$(uname -r)/kernel/arch/x86/events/ in rapl.ko for recent kernels or intel/intel-rapl-perf.ko for older kernels.

Thanks @gfieni for this info !

However, I think there are still cases where perf implementation is not available in a kernel (for a recent cpu), while powercap is ok. For example with a 5.4 kernel on a i7-10875H (that's a laptop spu, but I've seen similar issue with server class cpu).

dsaingre commented 2 years ago

Hi, Is there any update on the issue ? @PierreRustOrange @rouvoy

It seems I can't use the hwpc-sensor. My issue seems to be similar to https://github.com/powerapi-ng/powerapi/issues/125 Even after trying to install the appropriate kernel module by running the command advised in the previous comment, I still have issues with perf

sudo perf stat -a -e "power/energy-cores/" /bin/ls
[sudo] Mot de passe de dimitri : 
event syntax error: 'power/energy-cores/'
                     \___ Cannot find PMU `power'. Missing kernel support?
Run 'perf list' for a list of valid events

 Usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events

I have a laptop with a 5.13.0-28-generic kernel version and a 11th Gen Intel(R) Core(TM) i7-1165G7 CPU (if any more infos could help don't hesitate to ask me)

Would the advice solution be to implement a sensor accessing RAPL through powercap sysfs ?

gfieni commented 2 years ago

Hello @dsaingre, Which Linux distribution are you using ? Are the energy readings of powercap available on your system ? Could you give the result of the modinfo rapl command ?

dsaingre commented 2 years ago

Hi @gfieni, I'm using Ubuntu 20.04.3 LTS. I believe I do have the energy readings available :

tree /sys/devices/virtual/powercap
/sys/devices/virtual/powercap
├── dtpm
│   ├── enabled
│   ├── power
│   │   ├── async
│   │   ├── autosuspend_delay_ms
│   │   ├── control
│   │   ├── runtime_active_kids
│   │   ├── runtime_active_time
│   │   ├── runtime_enabled
│   │   ├── runtime_status
│   │   ├── runtime_suspended_time
│   │   └── runtime_usage
│   ├── subsystem -> ../../../../class/powercap
│   └── uevent
├── intel-rapl
│   ├── enabled
│   ├── intel-rapl:0
│   │   ├── constraint_0_max_power_uw
│   │   ├── constraint_0_name
│   │   ├── constraint_0_power_limit_uw
│   │   ├── constraint_0_time_window_us
│   │   ├── constraint_1_max_power_uw
│   │   ├── constraint_1_name
│   │   ├── constraint_1_power_limit_uw
│   │   ├── constraint_1_time_window_us
│   │   ├── constraint_2_max_power_uw
│   │   ├── constraint_2_name
│   │   ├── constraint_2_power_limit_uw
│   │   ├── constraint_2_time_window_us
│   │   ├── device -> ../../intel-rapl
│   │   ├── enabled
│   │   ├── energy_uj
│   │   ├── intel-rapl:0:0
│   │   │   ├── constraint_0_max_power_uw
│   │   │   ├── constraint_0_name
│   │   │   ├── constraint_0_power_limit_uw
│   │   │   ├── constraint_0_time_window_us
│   │   │   ├── device -> ../../intel-rapl:0
│   │   │   ├── enabled
│   │   │   ├── energy_uj
│   │   │   ├── max_energy_range_uj
│   │   │   ├── name
│   │   │   ├── power
│   │   │   │   ├── async
│   │   │   │   ├── autosuspend_delay_ms
│   │   │   │   ├── control
│   │   │   │   ├── runtime_active_kids
│   │   │   │   ├── runtime_active_time
│   │   │   │   ├── runtime_enabled
│   │   │   │   ├── runtime_status
│   │   │   │   ├── runtime_suspended_time
│   │   │   │   └── runtime_usage
│   │   │   ├── subsystem -> ../../../../../../class/powercap
│   │   │   └── uevent
│   │   ├── intel-rapl:0:1
│   │   │   ├── constraint_0_max_power_uw
│   │   │   ├── constraint_0_name
│   │   │   ├── constraint_0_power_limit_uw
│   │   │   ├── constraint_0_time_window_us
│   │   │   ├── device -> ../../intel-rapl:0
│   │   │   ├── enabled
│   │   │   ├── energy_uj
│   │   │   ├── max_energy_range_uj
│   │   │   ├── name
│   │   │   ├── power
│   │   │   │   ├── async
│   │   │   │   ├── autosuspend_delay_ms
│   │   │   │   ├── control
│   │   │   │   ├── runtime_active_kids
│   │   │   │   ├── runtime_active_time
│   │   │   │   ├── runtime_enabled
│   │   │   │   ├── runtime_status
│   │   │   │   ├── runtime_suspended_time
│   │   │   │   └── runtime_usage
│   │   │   ├── subsystem -> ../../../../../../class/powercap
│   │   │   └── uevent
│   │   ├── max_energy_range_uj
│   │   ├── name
│   │   ├── power
│   │   │   ├── async
│   │   │   ├── autosuspend_delay_ms
│   │   │   ├── control
│   │   │   ├── runtime_active_kids
│   │   │   ├── runtime_active_time
│   │   │   ├── runtime_enabled
│   │   │   ├── runtime_status
│   │   │   ├── runtime_suspended_time
│   │   │   └── runtime_usage
│   │   ├── subsystem -> ../../../../../class/powercap
│   │   └── uevent
│   ├── intel-rapl:1
│   │   ├── constraint_0_max_power_uw
│   │   ├── constraint_0_name
│   │   ├── constraint_0_power_limit_uw
│   │   ├── constraint_0_time_window_us
│   │   ├── constraint_1_max_power_uw
│   │   ├── constraint_1_name
│   │   ├── constraint_1_power_limit_uw
│   │   ├── constraint_1_time_window_us
│   │   ├── device -> ../../intel-rapl
│   │   ├── enabled
│   │   ├── energy_uj
│   │   ├── max_energy_range_uj
│   │   ├── name
│   │   ├── power
│   │   │   ├── async
│   │   │   ├── autosuspend_delay_ms
│   │   │   ├── control
│   │   │   ├── runtime_active_kids
│   │   │   ├── runtime_active_time
│   │   │   ├── runtime_enabled
│   │   │   ├── runtime_status
│   │   │   ├── runtime_suspended_time
│   │   │   └── runtime_usage
│   │   ├── subsystem -> ../../../../../class/powercap
│   │   └── uevent
│   ├── power
│   │   ├── async
│   │   ├── autosuspend_delay_ms
│   │   ├── control
│   │   ├── runtime_active_kids
│   │   ├── runtime_active_time
│   │   ├── runtime_enabled
│   │   ├── runtime_status
│   │   ├── runtime_suspended_time
│   │   └── runtime_usage
│   ├── subsystem -> ../../../../class/powercap
│   └── uevent
└── intel-rapl-mmio
    ├── enabled
    ├── intel-rapl-mmio:0
    │   ├── constraint_0_max_power_uw
    │   ├── constraint_0_name
    │   ├── constraint_0_power_limit_uw
    │   ├── constraint_0_time_window_us
    │   ├── constraint_1_max_power_uw
    │   ├── constraint_1_name
    │   ├── constraint_1_power_limit_uw
    │   ├── constraint_1_time_window_us
    │   ├── device -> ../../intel-rapl-mmio
    │   ├── enabled
    │   ├── energy_uj
    │   ├── max_energy_range_uj
    │   ├── name
    │   ├── power
    │   │   ├── async
    │   │   ├── autosuspend_delay_ms
    │   │   ├── control
    │   │   ├── runtime_active_kids
    │   │   ├── runtime_active_time
    │   │   ├── runtime_enabled
    │   │   ├── runtime_status
    │   │   ├── runtime_suspended_time
    │   │   └── runtime_usage
    │   ├── subsystem -> ../../../../../class/powercap
    │   └── uevent
    ├── power
    │   ├── async
    │   ├── autosuspend_delay_ms
    │   ├── control
    │   ├── runtime_active_kids
    │   ├── runtime_active_time
    │   ├── runtime_enabled
    │   ├── runtime_status
    │   ├── runtime_suspended_time
    │   └── runtime_usage
    ├── subsystem -> ../../../../class/powercap
    └── uevent

(is this relevant and what you're asking? Not very knowledgeable yet on powercap and co)

Regarding modinfo rapl:

filename:       /lib/modules/5.13.0-28-generic/kernel/arch/x86/events/rapl.ko
license:        GPL
srcversion:     E0C3F70A00E2957694E4176
alias:          cpu:type:x86,ven0002fam0019mod*:feature:*
alias:          cpu:type:x86,ven0009fam0018mod*:feature:*
alias:          cpu:type:x86,ven0002fam0017mod*:feature:*
alias:          cpu:type:x86,ven0000fam0006mod008F:feature:*
alias:          cpu:type:x86,ven0000fam0006mod009A:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0097:feature:*
alias:          cpu:type:x86,ven0000fam0006mod00A5:feature:*
alias:          cpu:type:x86,ven0000fam0006mod00A6:feature:*
alias:          cpu:type:x86,ven0000fam0006mod006A:feature:*
alias:          cpu:type:x86,ven0000fam0006mod006C:feature:*
alias:          cpu:type:x86,ven0000fam0006mod007D:feature:*
alias:          cpu:type:x86,ven0000fam0006mod007E:feature:*
alias:          cpu:type:x86,ven0000fam0006mod007A:feature:*
alias:          cpu:type:x86,ven0000fam0006mod005F:feature:*
alias:          cpu:type:x86,ven0000fam0006mod005C:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0066:feature:*
alias:          cpu:type:x86,ven0000fam0006mod009E:feature:*
alias:          cpu:type:x86,ven0000fam0006mod008E:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0055:feature:*
alias:          cpu:type:x86,ven0000fam0006mod005E:feature:*
alias:          cpu:type:x86,ven0000fam0006mod004E:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0085:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0057:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0056:feature:*
alias:          cpu:type:x86,ven0000fam0006mod004F:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0047:feature:*
alias:          cpu:type:x86,ven0000fam0006mod003D:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0046:feature:*
alias:          cpu:type:x86,ven0000fam0006mod0045:feature:*
alias:          cpu:type:x86,ven0000fam0006mod003F:feature:*
alias:          cpu:type:x86,ven0000fam0006mod003C:feature:*
alias:          cpu:type:x86,ven0000fam0006mod003E:feature:*
alias:          cpu:type:x86,ven0000fam0006mod003A:feature:*
alias:          cpu:type:x86,ven0000fam0006mod002D:feature:*
alias:          cpu:type:x86,ven0000fam0006mod002A:feature:*
depends:        
retpoline:      Y
intree:         Y
name:           rapl
vermagic:       5.13.0-28-generic SMP mod_unload modversions 
sig_id:         PKCS#7
signer:         Build time autogenerated kernel key
sig_key:        65:04:EF:DB:22:8E:60:98:46:12:AA:25:C3:1D:F0:FA:DE:9C:5F:68
sig_hashalgo:   sha512
signature:      43:C4:06:AF:9D:08:1D:3F:0F:6F:56:DD:20:BE:72:23:5D:D2:2E:98:
        06:D6:7F:59:A4:33:5A:07:2F:A3:73:6A:BB:D7:F9:67:60:87:82:75:
        92:A1:B0:41:DC:37:D5:BA:B7:A9:44:50:E1:26:47:B8:CA:65:3D:49:
        97:62:2A:32:13:4B:22:F2:28:A5:16:19:3D:E6:CD:6D:E1:06:DE:96:
        07:A1:FD:37:F9:9F:B3:48:D9:CA:30:40:14:4D:28:D0:E9:56:1C:4A:
        1E:02:58:74:76:07:A0:D4:3F:6D:A5:2C:71:19:D4:C1:0A:8B:60:AD:
        EB:E5:66:14:43:28:7A:B0:F0:62:E9:93:5B:D9:7D:F7:DE:F0:A5:DA:
        7E:F4:07:4C:55:33:1C:E2:C8:62:3E:4C:05:62:CF:E7:CD:43:81:15:
        87:27:4B:89:BA:C2:AD:07:AB:43:BA:65:F7:1C:61:9E:C6:B6:56:3D:
        3C:CC:CC:ED:61:FE:71:2E:B1:45:4D:FD:98:3E:C3:4A:75:9E:7F:D9:
        D8:1F:80:23:FD:C2:20:00:3B:C6:20:41:8D:89:A5:45:C5:AF:EC:63:
        EB:C9:06:D4:E2:EE:6D:70:2B:50:CA:CF:03:C5:58:07:A8:AD:F9:5F:
        6B:80:CD:90:E8:EF:BD:10:C0:1F:9D:8F:48:A6:F8:52:7B:F5:0B:CB:
        D9:8D:0D:B8:1D:17:40:52:AE:DA:90:85:92:F5:2A:65:5E:89:29:F7:
        FC:E1:55:E6:88:18:02:89:6A:AA:A2:E1:34:7E:DA:96:50:F4:B1:04:
        FE:8E:A1:B2:99:54:20:80:5A:AB:89:AD:A0:77:C6:2F:6F:6B:16:3F:
        5D:01:1A:2B:C1:A9:36:3C:13:CA:60:50:48:0E:D7:ED:1D:4A:F3:2F:
        65:BD:7C:2D:47:B8:65:EE:3A:54:08:8A:49:5D:EA:78:59:DA:05:F5:
        49:C6:A1:F3:ED:B6:F3:65:A0:0B:31:E3:9E:BF:F1:E6:9B:F0:9F:75:
        D6:9E:37:DC:61:A8:E9:84:DD:23:FC:BC:E2:42:00:D6:65:A7:6A:18:
        BF:8C:67:02:D5:9C:04:15:03:AE:13:47:47:8B:AC:AF:F4:4C:BA:EB:
        A9:AC:2E:99:32:A6:A7:29:E7:10:0A:E0:E6:F3:A1:6B:9B:C8:D7:4B:
        43:B6:A5:C7:DF:7E:FA:3D:11:26:F8:F7:E4:F4:E9:AA:14:D3:64:43:
        4C:CB:9A:DE:09:8B:2B:0D:E7:8A:78:7D:8D:59:F9:42:19:49:2C:14:
        CF:30:91:B1:BA:07:36:3D:26:57:7A:6C:2E:F4:C3:61:80:14:02:BD:
        DE:16:EB:05:A8:C8:5A:75:06:FC:FF:84

Does it helps to see if the issue is coming from my side?

PierreRustOrange commented 2 years ago

I think that's another case where rapl support is implemented in powercap (and thus fs access) but not in the perf tool.

If I'm understanding that code correctly (clearly no warranted here !! :), support for rapl is not even implemented in the current source tree, in perf https://github.com/torvalds/linux/blob/555f3d7be91a873114c9656069f1a9fa476ec41a/arch/x86/events/rapl.c#L776

Meanwhile it's been implemented in powercap two years ago : https://github.com/torvalds/linux/blob/0917b95079af82c69d8f5bab301faeebcd2cb3cd/arch/x86/events/msr.c#L89

I think we still need an option for the sensor to read the rapl information through the powercap fs .

Mbenni commented 2 years ago

Hi, is there any update regarding this issue ? @PierreRustOrange @rouvoy I've tried everything which was already said but i still can't use the hwpc sensor. I am using Ubuntu 22.04 LTS and Linux Kernel 5.15.0-30-generic. When i'm trying to start the sensor it seem it can't access to RAPL_ENERGY_PKG event :

$ docker run --rm --net=host --privileged --pid=host -v /sys:/sys -v /var/lib/docker/containers:/var/lib/docker/containers:ro -v /tmp/powerapi-sensor-reporting:/reporting -v $(pwd):/srv -v $(pwd)/config_file.json:/config_file.json powerapi/hwpc-sensor --config-file srv/config_sensor.json

I: 22-05-24 14:00:50 build: version v1.1.2 (rev: eba2fe195878bae1afadb29fb6da7c4151c890ad) (Jan 21 2022 - 14:54:06)
I: 22-05-24 14:00:50 uname: Linux 5.15.0-30-generic #31-Ubuntu SMP Thu May 5 10:00:34 UTC 2022 x86_64
E: 22-05-24 14:00:50 config: event 'RAPL_ENERGY_PKG' is invalid or unsupported by this machine
E: 22-05-24 14:00:50 config: failed to parse the provided config  file

I also get an issue with perf

& sudo perf stat -a -e "power/energy-cores/" /bin/ls
[sudo] password for mbennani: 
event syntax error: 'power/energy-cores/'
                     \___ Cannot find PMU `power'. Missing kernel support?
Run 'perf list' for a list of valid events

 Usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
PierreRustOrange commented 2 years ago

Hi, could you please tell us the reference of the cpu you'are using ?

Mbenni commented 2 years ago

Hi, sorry i forgot to tell i am using a 11th Gen Intel(R) Core(TM) i7-11390H @ 3.40GHz

BZConserto commented 2 years ago

Hi @PierreRustOrange, I have the same problem @Mbenni. I've tried everything which was already said but i still can't use the hwpc sensor. I am using Ubuntu 20.04.4 LTS, Linux Kernel 5.13.0-41-generic x86_64 and the reference of the cpu 11th Gen Intel® Core™ i7-1165G7 @ 2.80GHz × 8. Thank you in advance for your answer.

roda82 commented 2 years ago

Hello everyone,

We investigated this issue and it is clear that the Linux kernel (packaged with Ubuntu) does not support energy events access at least for "Tiger Lake" and "Rocket Lake" Intel families via the perf interface. To deploy hwpc_sensor on these families, the current solution requires to modify the kernel (cf. arch/x86/events/rapl.c) and recompile it. If you cannot do that, the best that we can do now is to create a list of supported families with your help. To check if you can access energy events on your host machine, you should run the command perf list | grep power/ and check that the output is not empty.

BZConserto commented 2 years ago

Hi, Thank you for your response. For me, the output is empty ? Thanks again

dsaingre commented 2 years ago

Output is empty on my side too

roda82 commented 2 years ago

Hello, In that case you have to modify your kernel if you want to use hwpc_sensor.

BZConserto commented 2 years ago

Hello, Thank you once again for your answer! I modified the kernel, now it works. I have a small question :). The measurements with smartwatts are watts or milliwatts? because I have weird values on Grafana of the order of 1000000? Thanks,

Mbenni commented 2 years ago

Hi @BZConserto Could you please tell us what did you modify in the kernel ? I am only student but this would help me a lot in my research. Thank you.

BZConserto commented 2 years ago

Hi @Mbenni I only have modified the linux kernel. Before I had 5.13.0-41, now, I installed 5.10.0-14. I hope its help you.

BZConserto commented 2 years ago

Hello, Thank you once again for your answer! I modified the kernel, now it works. I have a small question :). The measurements with smartwatts are watts or milliwatts? because I have weird values on Grafana of the order of 1000000? Thanks,

roda82 commented 2 years ago

Hello, measurements are in watts.

Laccio commented 10 months ago

Hello everyone, from i5 13600k with Ubuntu 22.04.3 LTS and kernel as 5.10.0-051000-generic giving:

sudo docker run --rm \
--net=host \
--privileged \
--pid=host \
-v /sys:/sys \
-v /var/lib/docker/containers:/var/lib/docker/containers:ro \
-v /tmp/powerapi-sensor-reporting:/reporting \
-v $(pwd):/srv \
powerapi/hwpc-sensor \
-n "$(hostname -f)" \
-r "mongodb" -U "mongodb://127.0.0.1" -D "test" -C "prep" \
-s "rapl" -o -e "RAPL_ENERGY_PKG" \
-s "msr" -e "TSC" -e "APERF" -e "MPERF" \
-c "core" -e "CPU_CLK_UNHALTED:REF_P" -e "CPU_CLK_UNHALTED:THREAD_P" -e "LLC_MISSES" -e "INSTRUCTIONS_RETIRED"

I'm getting this output.

I: 23-11-18 22:41:43 build: version unknown (rev: unknown)
I: 23-11-18 22:41:43 uname: Linux 5.10.0-051000-generic #202012132330 SMP Sun Dec 13 23:33:36 UTC 2020 x86_64
I: 23-11-18 22:41:43 pmu: found ix86arch 'Intel X86 architectural PMU' having 7 events, 9 counters (6 general, 3 fixed)
I: 23-11-18 22:41:43 pmu: found perf 'perf_events generic PMU' having 184 events, 0 counters (0 general, 0 fixed)
I: 23-11-18 22:41:43 pmu: found perf_raw 'perf_events raw PMU' having 1 events, 0 counters (0 general, 0 fixed)
I: 23-11-18 22:41:43 pmu: found intel_msr 'Intel MSR' having 6 events, 6 counters (0 general, 6 fixed)
E: 23-11-18 22:41:43 config: event 'RAPL_ENERGY_PKG' is invalid or unsupported by this machine
E: 23-11-18 22:41:43 config: failed to parse the provided command-line arguments

What do u suggest me to do? I have already downgraded the kernel to 5.10 as suggested above but still not working. I need RAPL energy for my studies.

roda82 commented 10 months ago

Hi, Unfortunately, currently the Linux Kernel does not support energy events access for your "Raptor Lake" Intel Processor. We are working in a new Formula based on procfs that will allow the usage of PowerAPI with this kind of processors.