Closed novacain1 closed 8 months ago
@novacain1 thanks for getting this info. Would you please try the 0.10 kepler operator with 0.7.2 kepler image? also cc @sthaha @vprashar2929
Looks to be similar behavior @rootfs :
make deploy "OPERATOR_IMG=quay.io/sustainable_computing_io/kepler-operator:0.10.0" KEPLER_IMG="quay.io/sustainable_computing_io/kepler:release-0.7.2"
Install sample kepler CRD in config/samples:
$ oc describe pod -n kepler-operator-system kepler-operator-controller-6c8d966f5c-mp62s | grep Image
Image: quay.io/sustainable_computing_io/kepler-operator:0.10.0
$ oc describe daemonsets.apps kepler | grep Image
Image: quay.io/sustainable_computing_io/kepler:release-0.7.2
Evaluation:
$ oc exec -ti -n kepler-operator daemonset/kepler -- bash -c "curl localhost:9103/metrics |grep kepler_container_bpf_cpu |sort -k 2 -g "
kepler_container_bpf_cpu_time_us_total{container_id="3efbb44685f97dbcfef9ff86e5ec94c9c60fb606091ceea87cec7b73edb83bb5",container_name="oauth-openshift",container_namespace="openshift-authentication",pod_name="oauth-openshift-67545f89f7-cgjz6"} 0
kepler_container_bpf_cpu_time_us_total{container_id="3fafb0f3315056d9d9820a6504990a7c75195162ada49718cfe780aa2deddde7",container_name="oauth-apiserver",container_namespace="openshift-oauth-apiserver",pod_name="apiserver-5d9d4c674c-9mmj7"} 0
$ oc exec -ti -n kepler-operator daemonset/kepler -- bash -c "curl localhost:9103/metrics |grep kepler_container_cpu |sort -k 2 -g "
kepler_container_cpu_instructions_total{container_id="3efbb44685f97dbcfef9ff86e5ec94c9c60fb606091ceea87cec7b73edb83bb5",container_name="oauth-openshift",container_namespace="openshift-authentication",pod_name="oauth-openshift-67545f89f7-cgjz6"} 0
kepler_container_cpu_instructions_total{container_id="3fafb0f3315056d9d9820a6504990a7c75195162ada49718cfe780aa2deddde7",container_name="oauth-apiserver",container_namespace="openshift-oauth-apiserver",pod_name="apiserver-5d9d4c674c-9mmj7"} 0
@sthaha @vprashar2929
@novacain1 I can see the metrics with values available on the OpenShift cluster.
Can you enable UWM and then query the metrics from the OpenShift console?
UWM is enabled. I am forwarding metrics to a centralized prom instance via Observatorium.
Here are hwmon metrics for the same cluster (interceptor) which has two nodes, showing data:
It looks like cgroup metrics were removed in recent releases and now kepler only relies on ebpf metrics. PR #1185 resolves getting ebpf metrics, testing in my lab. However, the idle power calculations don't look correct. I'll open another issue.
Suggest leaving this open until PR #1185 merges into the mainline code, as without this I wasn't even seeing the metrics being collected by the kepler exporter. Many thanks to @rootfs for his help here.
What happened?
I am running kepler on an OCP 4.14.7 setup that runs kernel 5.14.0-284.45.1.rt14.330.el9_2.x86_64. I'm using the community Operator for installation and configuration of Kepler.
Perf stat shows results:
https://github.com/sustainable-computing-io/kepler/issues/959 was opened on OpenShift 4.12 (which contain an older 4.18 kernel).
Happy to try some things here, just let me know.
What did you expect to happen?
bpf stats should be non-zero on realtime kernels.
How can we reproduce it (as minimally and precisely as possible)?
Reproduced on a baremetal OpenShift 4.14.7 cluster.
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)