sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.12k stars 177 forks source link

missing ebpf readings on realtime (RT) 5.14+ kernels #1175

Closed novacain1 closed 8 months ago

novacain1 commented 8 months ago

What happened?

I am running kepler on an OCP 4.14.7 setup that runs kernel 5.14.0-284.45.1.rt14.330.el9_2.x86_64. I'm using the community Operator for installation and configuration of Kepler.

$ oc exec -ti -n openshift-kepler-operator daemonset/kepler-exporter-ds -- bash -c "curl localhost:9103/metrics |grep kepler_container_bpf_cpu |sort -k 2 -g "

kepler_container_bpf_cpu_time_us_total{container_id="09a3e3813a05e48588919afb80faddc06698f42b080183af4f09ee92a9cd598d",container_name="kube-controller-manager-operator",container_namespace="openshift-kube-controller-manager-operator",pod_name="kube-controller-manager-operator-54bccb5847-jd7lp"} 0
kepler_container_bpf_cpu_time_us_total{container_id="0c3f742c71798a362c2fa5d7ab106f3d6586812390fd4769999fd50012c2af16",container_name="registry-server",container_namespace="openshift-marketplace",pod_name="community-operators-kepler-xjzsl"} 0
kepler_container_bpf_cpu_time_us_total{container_id="0cbb5fe1353262f3da1c4d157cb0a4aa5a744df5f5bcac5b60cfb2bc398bd651",container_name="prometheus",container_namespace="openshift-monitoring",pod_name="prometheus-k8s-0"} 0
kepler_container_bpf_cpu_time_us_total{container_id="0ec37fe7658d818163d135054fe0dc9db8bae1fbb39588cc941813250e936bff",container_name="machine-config-daemon",container_namespace="openshift-machine-config-operator",pod_name="machine-config-daemon-97lll"} 0
$ oc exec -ti -n openshift-kepler-operator daemonset/kepler-exporter-ds -- bash -c "curl localhost:9103/metrics |grep kepler_container_cpu |sort -k 2 -g "

kepler_container_cpu_cycles_total{container_id="0fb4bf318569b310a3d99c56b8bc0c6010c86fe093da40f653153f02d17fe38f",container_name="pull",container_namespace="openshift-marketplace",pod_name="593052ad60aa4ab06b18439bd06e9282e44e0745ad0c7a6a14e2bbba4ansqqh"} 0
kepler_container_cpu_cycles_total{container_id="0feba45ac0ccb903fb58741d6ec6c91f6f8d50596a233315c7f169f815d4ea37",container_name="local-storage-operator",container_namespace="openshift-local-storage",pod_name="local-storage-operator-7c48958ffb-d4mxm"} 0
kepler_container_cpu_cycles_total{container_id="10a01146f00506f918d0db6209ed754fa2453760493eba4a85ef2d6e5a7be733",container_name="kube-rbac-proxy-main",container_namespace="openshift-monitoring",pod_name="openshift-state-metrics-7f8ff767bd-zgmkf"} 0
kepler_container_cpu_cycles_total{container_id="10f7b1b479db87ef6a045efd91fbacfcc46b3375668b2c917ab463fac14e9a6f",container_name="controller-manager",container_namespace="openshift-controller-manager",pod_name="controller-manager-c9b5ff7cd-s4vjw"} 0

Perf stat shows results:

perf stat -e cycles,cache-misses sleep 1

 Performance counter stats for 'sleep 1':

           1355729      cycles                                                                
             16058      cache-misses                                                          

       1.001605151 seconds time elapsed

       0.000000000 seconds user
       0.001651000 seconds sys

https://github.com/sustainable-computing-io/kepler/issues/959 was opened on OpenShift 4.12 (which contain an older 4.18 kernel).

Happy to try some things here, just let me know.

What did you expect to happen?

bpf stats should be non-zero on realtime kernels.

How can we reproduce it (as minimally and precisely as possible)?

Reproduced on a baremetal OpenShift 4.14.7 cluster.

Anything else we need to know?

No response

Kepler image tag

Operator: kepler-operator.v0.9.2 Image: quay.io/sustainable_computing_io/kepler:release-0.6.1

Kubernetes version

```console $ oc version Server Version: 4.14.7 ```

Cloud provider or bare metal

baremetal

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs commented 8 months ago

@novacain1 thanks for getting this info. Would you please try the 0.10 kepler operator with 0.7.2 kepler image? also cc @sthaha @vprashar2929

novacain1 commented 8 months ago

Looks to be similar behavior @rootfs :

make deploy "OPERATOR_IMG=quay.io/sustainable_computing_io/kepler-operator:0.10.0" KEPLER_IMG="quay.io/sustainable_computing_io/kepler:release-0.7.2"

Install sample kepler CRD in config/samples:

$ oc describe pod -n kepler-operator-system kepler-operator-controller-6c8d966f5c-mp62s | grep Image
    Image:         quay.io/sustainable_computing_io/kepler-operator:0.10.0
$ oc describe daemonsets.apps kepler | grep Image
    Image:      quay.io/sustainable_computing_io/kepler:release-0.7.2

Evaluation:

$ oc exec -ti -n kepler-operator daemonset/kepler -- bash -c "curl localhost:9103/metrics |grep kepler_container_bpf_cpu |sort -k 2 -g "
kepler_container_bpf_cpu_time_us_total{container_id="3efbb44685f97dbcfef9ff86e5ec94c9c60fb606091ceea87cec7b73edb83bb5",container_name="oauth-openshift",container_namespace="openshift-authentication",pod_name="oauth-openshift-67545f89f7-cgjz6"} 0
kepler_container_bpf_cpu_time_us_total{container_id="3fafb0f3315056d9d9820a6504990a7c75195162ada49718cfe780aa2deddde7",container_name="oauth-apiserver",container_namespace="openshift-oauth-apiserver",pod_name="apiserver-5d9d4c674c-9mmj7"} 0
$ oc exec -ti -n kepler-operator daemonset/kepler -- bash -c "curl localhost:9103/metrics |grep kepler_container_cpu |sort -k 2 -g "
kepler_container_cpu_instructions_total{container_id="3efbb44685f97dbcfef9ff86e5ec94c9c60fb606091ceea87cec7b73edb83bb5",container_name="oauth-openshift",container_namespace="openshift-authentication",pod_name="oauth-openshift-67545f89f7-cgjz6"} 0
kepler_container_cpu_instructions_total{container_id="3fafb0f3315056d9d9820a6504990a7c75195162ada49718cfe780aa2deddde7",container_name="oauth-apiserver",container_namespace="openshift-oauth-apiserver",pod_name="apiserver-5d9d4c674c-9mmj7"} 0
rootfs commented 8 months ago

@sthaha @vprashar2929

vprashar2929 commented 8 months ago

@novacain1 I can see the metrics with values available on the OpenShift cluster.

Screenshot 2024-01-10 at 11 46 52 PM Screenshot 2024-01-10 at 11 46 46 PM

Can you enable UWM and then query the metrics from the OpenShift console?

novacain1 commented 8 months ago

UWM is enabled. I am forwarding metrics to a centralized prom instance via Observatorium.

image

image

Here are hwmon metrics for the same cluster (interceptor) which has two nodes, showing data:

image

novacain1 commented 8 months ago

It looks like cgroup metrics were removed in recent releases and now kepler only relies on ebpf metrics. PR #1185 resolves getting ebpf metrics, testing in my lab. However, the idle power calculations don't look correct. I'll open another issue.

Suggest leaving this open until PR #1185 merges into the mainline code, as without this I wasn't even seeing the metrics being collected by the kepler exporter. Many thanks to @rootfs for his help here.