sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.13k stars 178 forks source link

Kepler not reporting correct process name in metrics #1354

Open vprashar2929 opened 5 months ago

vprashar2929 commented 5 months ago

What happened?

When Kepler using the latest deployed on a machine currently it reports the wrong process name in the exported metrics.

Attaching some screenshots for reference:

ps -ef | grep 75577
qemu       75577       1  8 Apr15 ?        01:10:07 /usr/bin/qemu-system-x86_64 -name guest=fedora39,debug-threads=on -S 

Output from pstree command:

pstree -p | grep qemu
           |-qemu-system-x86(75577)-+-{qemu-system-x86}(75605)
           |                        |-{qemu-system-x86}(75617)
           |                        |-{qemu-system-x86}(75618)
           |                        |-{qemu-system-x86}(75619)
           |                        |-{qemu-system-x86}(75620)
           |                        |-{qemu-system-x86}(75622)
           |                        |-{qemu-system-x86}(109718)
           |                        |-{qemu-system-x86}(109719)
           |                        |-{qemu-system-x86}(109720)
           |                        `-{qemu-system-x86}(109721)

Screenshot 2024-04-16 at 1 21 16 PM

What did you expect to happen?

Kepler should report the correct command name in the metrics that it exports.

How can we reproduce it (as minimally and precisely as possible)?

Run Kepler either on Kubernetes or using the docker-compose locally which is present here: https://github.com/sustainable-computing-io/kepler/tree/main/hackdocker-compose

Anything else we need to know?

No response

Kepler image tag

latest

Kubernetes version

```console $ kubectl version # paste output here ```

Cloud provider or bare metal

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

dave-tucker commented 4 months ago

I know why this is 🎉 See: https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c#L247C3-L247C23

As @vimalk78 found out, from eBPF we record the:

From the perspective of userland, the PID is actually what the kernel calls the TGID - you'll notice that we accidentally on-purpose switch the order of these fields in the definition of the struct: https://github.com/sustainable-computing-io/kepler/blob/main/pkg/bpf/types.go#L49-L50

TL:DR the comm that we record belongs to the pid (as the kernel sees it, not as userland sees it), so you will indeed get values like CPU 0/KVM.

I think the fix required here is going to be either:

  1. Don't record the comm from eBPF and look it up from procfs instead
  2. Only set the comm if pid == tgid

I'm going to try and verify this theory on my development machine at some point later this week.

vimalk78 commented 1 month ago

@vprashar2929 is this still an issue?

Ref: https://github.com/sustainable-computing-io/kepler/issues/1640

vprashar2929 commented 1 month ago

closing as the issue is addressed and fixed

vprashar2929 commented 3 weeks ago

reopening the issue as Kepler latest still reports the process name as incorrect:

Screenshot 2024-09-13 at 2 25 23 PM
vimalk78 commented 3 weeks ago

what is expected process name in above test?

vprashar2929 commented 3 weeks ago
❯ pstree -p | grep qemu
           |-qemu-system-x86(110356)-+-{qemu-system-x86}(110367)
           |                         |-{qemu-system-x86}(110370)
           |                         |-{qemu-system-x86}(110371)
           |                         |-{qemu-system-x86}(110372)
           |                         |-{qemu-system-x86}(110373)
           |                         |-{qemu-system-x86}(110374)
           |                         |-{qemu-system-x86}(110375)
           |                         |-{qemu-system-x86}(110377)
           |                         `-{qemu-system-x86}(2178213)