powerapi-ng / hwpc-sensor

Hardware Performance Counters monitoring agent for containers.
BSD 3-Clause "New" or "Revised" License
14 stars 16 forks source link

Sensor fails to detect containers #7

Open PierreRust opened 3 years ago

PierreRust commented 3 years ago

Hi,

I have a system where the sensor fails to detect any running container, both when using docker or kubernetes directly.

When looking at the source code, I see that the sensor

On my system,

Maybe we could bypass type detection and file validation altogether, as I proposed in the old #2 PR ? That should solve this issue.

Finally, I have no idea on why the /sys/fs/cgroup/perf_event the system is diffferent on this system. It's a centos 7.2, I usually use debian and ubuntu systems and never had this issue before.

Please let me know if I can add something useful or if you have any idea to get me started on this issue.

gfieni commented 3 years ago

Hello, Could you give a sample of find /sys/fs/cgroup/perf_event/kubepods.slice/ and find /sys/fs/cgroup/perf_event/system.slice/ please ?

PierreRust commented 3 years ago

Sure, : find /sys/fs/cgroup/perf_event/system.slice/ > system_find.txt find /sys/fs/cgroup/perf_event/kubepods.slice/ > kube_find.txt

PierreRustOrange commented 2 years ago

Hello, I still have this issue, and I use a workaround where I

I think the first step could be activated with a cli flag, when type does not matter, but I'm not sure how to properly handle the second point, any idea ? Allow overriding them with a cli option ?

I'd like to make a PR for this, to avoid maintaining a fork for our centos platform. Besides, I suspect we might encounter other system where these paths have other values.

PierreRustOrange commented 2 years ago

I almost forgot, I still have a pending PR for the first point : #5

marceloamaral commented 2 years ago

I'm having some problem related to this, i.e., deploying in Kubernetes. But in my case the /sys/fs/cgroup/perf_event folder doesn't exist

Any clues on how to fix this?

Btw, my Kernel has the CONFIG_CGROUP_PERF=yes

gfieni commented 2 years ago

Hello @marceloamaral, This is probably due to the usage of the unified cgroup v2 hierarchy by your distribution. Unfortunately, the support for the unified cgroup hierarchy is ongoing work.

For now, the only way to get the sensor work for this kind of environment is to disable the unified hierarchy, and to setup the Kubernetes cluster to use cgroupfs as cgroup driver. (more information)

marceloamaral commented 2 years ago

Thanks, I'll check it out!

marceloamaral commented 2 years ago

Hi, I deployed hwpc-sensor in the cluster with cgroup v1.

The sensor finds the VMs running in the system but not the containers.... I don't see any error, but it is not finding the containers in the system.

kubectl logs hwpc-sensor-exporter-fb4cs
I: 22-06-29 11:14:22 build: version v1.1.2 (rev: eba2fe195878bae1afadb29fb6da7c4151c890ad) (Jan 21 2022 - 14:54:06)
I: 22-06-29 11:14:22 uname: Linux 5.4.0-66-generic #74~18.04.2-Ubuntu SMP Fri Feb 5 11:17:31 UTC 2021 x86_64
I: 22-06-29 11:14:22 pmu: found ix86arch 'Intel X86 architectural PMU' having 7 events, 7 counters (4 general, 3 fixed)
I: 22-06-29 11:14:22 pmu: found perf 'perf_events generic PMU' having 181 events, 0 counters (0 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found rapl 'Intel RAPL' having 2 events, 3 counters (0 general, 3 fixed)
I: 22-06-29 11:14:22 pmu: found perf_raw 'perf_events raw PMU' having 1 events, 0 counters (0 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha0 'Intel SkylakeX CHA0 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha1 'Intel SkylakeX CHA1 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha2 'Intel SkylakeX CHA2 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha3 'Intel SkylakeX CHA3 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha4 'Intel SkylakeX CHA4 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha5 'Intel SkylakeX CHA5 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha6 'Intel SkylakeX CHA6 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha7 'Intel SkylakeX CHA7 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha8 'Intel SkylakeX CHA8 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha9 'Intel SkylakeX CHA9 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha10 'Intel SkylakeX CHA10 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha11 'Intel SkylakeX CHA11 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha12 'Intel SkylakeX CHA12 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha13 'Intel SkylakeX CHA13 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha14 'Intel SkylakeX CHA14 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha15 'Intel SkylakeX CHA15 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha16 'Intel SkylakeX CHA16 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha17 'Intel SkylakeX CHA17 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha18 'Intel SkylakeX CHA18 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_cha19 'Intel SkylakeX CHA19 uncore' having 99 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio0 'Intel SkylakeX IIO0 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio1 'Intel SkylakeX IIO1 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio2 'Intel SkylakeX IIO2 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio3 'Intel SkylakeX IIO3 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio4 'Intel SkylakeX IIO4 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_iio5 'Intel SkylakeX IIO5 uncore' having 16 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc0 'Intel SkylakeX IMC0 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc1 'Intel SkylakeX IMC1 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc2 'Intel SkylakeX IMC2 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc3 'Intel SkylakeX IMC3 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc4 'Intel SkylakeX IMC4 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_imc5 'Intel SkylakeX IMC5 uncore' having 46 events, 5 counters (4 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m2m0 'Intel SkylakeX M2M0 uncore' having 121 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m2m1 'Intel SkylakeX M2M1 uncore' having 121 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m3upi0 'Intel SkylakeX M3UPI0 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m3upi1 'Intel SkylakeX M3UPI1 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_m3upi2 'Intel SkylakeX M3UPI2 uncore' having 111 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_pcu 'Intel SkylakeX PCU uncore' having 29 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_ubo 'Intel SkylakeX U-Box uncore' having 5 events, 3 counters (2 general, 1 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_upi0 'Intel SkylakeX UPI0 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_upi1 'Intel SkylakeX UPI1 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found skx_unc_upi2 'Intel SkylakeX UPI2 uncore' having 34 events, 4 counters (4 general, 0 fixed)
I: 22-06-29 11:14:22 pmu: found clx 'Intel CascadeLake X' having 85 events, 11 counters (8 general, 3 fixed)
I: 22-06-29 11:14:22 pmu: found intel_msr 'Intel MSR' having 6 events, 6 counters (0 general, 6 fixed)
I: 22-06-29 11:14:22 sensor: configuration is valid, starting monitoring...
I: 22-06-29 11:14:23 perf<all>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-63-guestvm-clone.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf<system>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-66-guestvm2x8-clone.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-64-guestvm-clone1.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-62-guestvm.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-67-guestvm2x8-clone1.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-65-guestvm2x8.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-69-guestvm8x32-clone.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf</machine/qemu-68-guestvm8x32.libvirt-qemu>: monitoring actor started
I: 22-06-29 11:14:23 perf<system>: monitoring actor started

Any idea how to debug/fix that?