sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.19k stars 184 forks source link

kepler_node_core_joules_total=0 on RHEL9/x86_64 #1346

Closed jharriga closed 1 week ago

jharriga commented 7 months ago

What happened?

Downloaded and installed

https://github.com/sustainable-computing-io/kepler/releases/download/v0.7.9/kepler.rpm.tar.gz

On server running

Ran several CPU intensive workloads and metric remained '0'

What did you expect to happen?

expected the metric reading to increase/track system cpu usage

How can we reproduce it (as minimally and precisely as possible)?

Download & install rpm start service root# systemctl start container-kepler --now root# curl localhost:8888/metrics | grep

Anything else we need to know?

No response

Kepler image tag

v0.7.9

Kubernetes version

NONE

Cloud provider or bare metal

bare metal

OS version

```console # On Linux: $ cat /etc/os-release Red Hat Enterprise Linux 9.4 (Plow) $ uname -a Linux perf-intel-28.perf.eng.bos2.dc.redhat.com 5.14.0-417.kpq1.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Feb 2 14:05:04 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

# rpm --version RPM version 4.16.1.3

Kepler deployment config

For standalone: # put your Kepler command argument here root# systemctl start container-kepler --now root# curl localhost:8888/metrics | grep

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs commented 6 months ago

@jharriga can you double check if it is kepler_node_core_joules_total or kepler_node_package_joules_total?

Current Ampere xgene hwmon only reports the CPU and I/O power (per doc here). We cannot get DRAM power. So to align with the RAPL reporting, kepler only reports kepler_node_core_total (per code here)

jharriga commented 6 months ago

This was originally reported on x86. Running with v0.7.10 Running w/v0.7.10 on x86 I do see the metric kepler-node-core-joules-total does have value root# curl localhost:8888/metrics | grep kepler_node_core_joules_total

As for ARM, on Ampere server running v0.7.10 I see:

Both the kepler_node_core_joules_total and kepler_node_package_joules_total metrics do have a values. This doesn't seem to align with what you expected in the previous comment.

At any rate I think this Issue can be CLOSED since the originally reported problem on x86 appears to have been resolved.