sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.07k stars 169 forks source link

Unable to access Kepler metrics from prometheus #767

Closed edoblette closed 10 months ago

edoblette commented 1 year ago

What happened?

Unable to export metrics to prometheus. Kepler appears to be "Dropped"from prometheus's /service-discovery page. I'm reporting this problem from previous issue @ How to select the tag of kepler-helm-chart to install Kepler? from Kepler Helm deployement, following the initiative of @rootfs and @LAI-chuchi

However, metrics are available on port 9102 of the kepler pod at http://:9102/metrics.

I'm using Kind for my K8s cluster, with Prometheus and Grafana already deployed. I also use Cilium without any problems.

What did you expect to happen?

Get Kepler metrics enable on my prometheus queries dashbord.

How can we reproduce it (as minimally and precisely as possible)?

Linux 22.04 Kernel: 5.5.0-050500-generic kind version 0.18.0*

Anything else we need to know?

Log from kepler pod:

``` I0706 13:50:55.221651 1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0706 13:50:55.416231 1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM? I0706 13:50:55.714345 1 exporter.go:151] Kepler running on version: 364c44f I0706 13:50:55.714388 1 config.go:212] using gCgroup ID in the BPF program: true I0706 13:50:55.714445 1 config.go:214] kernel version: 5.5 I0706 13:50:55.714484 1 exporter.go:171] EnabledBPFBatchDelete: true I0706 13:50:55.714602 1 power.go:53] use sysfs to obtain power I0706 13:51:03.936389 1 exporter.go:184] Initializing the GPU collector I0706 13:51:03.937016 1 watcher.go:67] Using in cluster k8s config I0706 13:51:30.137397 1 bcc_attacher.go:171] Successfully load eBPF module with option: [-DMAP_SIZE=10240 -DNUM_CPUS=16] I0706 13:51:30.717792 1 exporter.go:228] Started Kepler in 35.003475726s I0706 13:51:33.720063 1 container_hc_collector.go:130] failed to get bpf table elemets, err: failed to batch get: invalid argument I0706 13:51:33.728652 1 container_hc_collector.go:211] resetting EnabledBPFBatchDelete to false I0706 13:51:36.914588 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.914707 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.914739 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.914800 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.914850 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.914881 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.914949 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.914985 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915014 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915042 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915070 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915103 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915132 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915160 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915188 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915216 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915243 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915272 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915300 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915328 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915356 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915383 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915411 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory I0706 13:51:36.915442 1 container_hc_collector.go:164] could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory [...] (goes on forever) ```

The metrics from my pod (http://172.18.0.2:9102/metrics):

```txt # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.4665e-05 go_gc_duration_seconds{quantile="0.25"} 9.1751e-05 go_gc_duration_seconds{quantile="0.5"} 0.000116279 go_gc_duration_seconds{quantile="0.75"} 0.000145458 go_gc_duration_seconds{quantile="1"} 0.100348473 go_gc_duration_seconds_sum 7.546713807 go_gc_duration_seconds_count 919 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 20 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.18.1"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 7.207032e+06 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. # TYPE go_memstats_alloc_bytes_total counter go_memstats_alloc_bytes_total 3.389494576e+09 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. # TYPE go_memstats_buck_hash_sys_bytes gauge go_memstats_buck_hash_sys_bytes 1.537498e+06 # HELP go_memstats_frees_total Total number of frees. # TYPE go_memstats_frees_total counter go_memstats_frees_total 5.0197247e+07 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 5.604088e+06 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. # TYPE go_memstats_heap_alloc_bytes gauge go_memstats_heap_alloc_bytes 7.207032e+06 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. # TYPE go_memstats_heap_idle_bytes gauge go_memstats_heap_idle_bytes 6.864896e+06 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. # TYPE go_memstats_heap_inuse_bytes gauge go_memstats_heap_inuse_bytes 8.470528e+06 # HELP go_memstats_heap_objects Number of allocated objects. # TYPE go_memstats_heap_objects gauge go_memstats_heap_objects 49085 # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. # TYPE go_memstats_heap_released_bytes gauge go_memstats_heap_released_bytes 4.243456e+06 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. # TYPE go_memstats_heap_sys_bytes gauge go_memstats_heap_sys_bytes 1.5335424e+07 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. # TYPE go_memstats_last_gc_time_seconds gauge go_memstats_last_gc_time_seconds 1.6886531940142617e+09 # HELP go_memstats_lookups_total Total number of pointer lookups. # TYPE go_memstats_lookups_total counter go_memstats_lookups_total 0 # HELP go_memstats_mallocs_total Total number of mallocs. # TYPE go_memstats_mallocs_total counter go_memstats_mallocs_total 5.0246332e+07 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. # TYPE go_memstats_mcache_inuse_bytes gauge go_memstats_mcache_inuse_bytes 19200 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. # TYPE go_memstats_mcache_sys_bytes gauge go_memstats_mcache_sys_bytes 31200 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. # TYPE go_memstats_mspan_inuse_bytes gauge go_memstats_mspan_inuse_bytes 267920 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. # TYPE go_memstats_mspan_sys_bytes gauge go_memstats_mspan_sys_bytes 375360 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. # TYPE go_memstats_next_gc_bytes gauge go_memstats_next_gc_bytes 9.534512e+06 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. # TYPE go_memstats_other_sys_bytes gauge go_memstats_other_sys_bytes 3.28271e+06 # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. # TYPE go_memstats_stack_inuse_bytes gauge go_memstats_stack_inuse_bytes 1.441792e+06 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. # TYPE go_memstats_stack_sys_bytes gauge go_memstats_stack_sys_bytes 1.441792e+06 # HELP go_memstats_sys_bytes Number of bytes obtained from system. # TYPE go_memstats_sys_bytes gauge go_memstats_sys_bytes 2.7608072e+07 # HELP go_threads Number of OS threads created. # TYPE go_threads gauge go_threads 21 # HELP kepler_container_package_joules_total Aggregated RAPL value in package (socket) in joules # TYPE kepler_container_package_joules_total counter kepler_container_package_joules_total{command="",container_id="",container_name="upf",container_namespace="core5g-sys",mode="dynamic",pod_name="core5g-free5gc-upf-upf-6cb555dd9-fcl68"} 0 kepler_container_package_joules_total{command="",container_id="",container_name="upf",container_namespace="core5g-sys",mode="idle",pod_name="core5g-free5gc-upf-upf-6cb555dd9-fcl68"} 1113.33 kepler_container_package_joules_total{command="",container_id="0148e9f8e423c9eee15a211f75f7fda26e946d3a803d145c0ad51192685990e1",container_name="amf",container_namespace="core5g-sys",mode="dynamic",pod_name="core5g-free5gc-amf-amf-d67fc97bf-z5qwf"} 0 kepler_container_package_joules_total{command="",container_id="0148e9f8e423c9eee15a211f75f7fda26e946d3a803d145c0ad51192685990e1",container_name="amf",container_namespace="core5g-sys",mode="idle",pod_name="core5g-free5gc-amf-amf-d67fc97bf-z5qwf"} 1113.33 kepler_container_package_joules_total{command="",container_id="0164b39138b37d4d36334522c0f862cf5ed610a0cb162371bfb286a9f891497c",container_name="install-cni-binaries",container_namespace="kube-system",mode="dynamic",pod_name="cilium-kjj2f"} 0 kepler_container_package_joules_total{command="",container_id="0164b39138b37d4d36334522c0f862cf5ed610a0cb162371bfb286a9f891497c",container_name="install-cni-binaries",container_namespace="kube-system",mode="idle",pod_name="cilium-kjj2f"} 1113.33 kepler_container_package_joules_total{command="",container_id="0267587b6b0919ac7297dcedf9378831c8d0538e0d2686608b3f1d92ec617151",container_name="wait-mongo",container_namespace="core5g-sys",mode="dynamic",pod_name="core5g-free5gc-nrf-nrf-f89c6b99b-ntdkj"} 0 kepler_container_package_joules_total{command="",container_id="0267587b6b0919ac7297dcedf9378831c8d0538e0d2686608b3f1d92ec617151",container_name="wait-mongo",container_namespace="core5g-sys",mode="idle",pod_name="core5g-free5gc-nrf-nrf-f89c6b99b-ntdkj"} 1113.33 kepler_container_package_joules_total{command="",container_id="0ab33bc2b900f87b20a41c4f848f29314ba61eda210ba66bce841734136038ef",container_name="prometheus-server-configmap-reload",container_namespace="monitoring",mode="dynamic",pod_name="prometheus-server- [...] ```

Kepler image tag:

 kepler-0.4.3            release-0.5.1

Kubernetes version

```console Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:47:38Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-30T06:34:50Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"} ```

OS version

```console # On Linux: $ cat /etc/os-release NAME="Ubuntu" VERSION="20.04.6 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.6 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal $ uname -a Linux yd-CZC34914Z1 5.5.0-050500-generic #202001262030 SMP Mon Jan 27 01:33:36 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ```

Kepler deployment config

```console helm repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart helm install kepler kepler/kepler --namespace kepler --create-namespace --values config/kepler/kepler-value.yaml ```
```yaml value.yaml # -- Replaces the name of the chart in the Chart.yaml file nameOverride: "" # -- Replaces the generated name fullnameOverride: "" image: # -- Repository to pull the image from repository: "quay.io/sustainable_computing_io/kepler" # -- Image tag, if empty it will get it from the chart's appVersion tag: "" # -- Pull policy pullPolicy: Always # -- Secret name for pulling images from private repository imagePullSecrets: [] # -- Additional DaemonSet annotations annotations: {} # -- Additional pod annotations podAnnotations: {} # -- Privileges and access control settings for a Pod (all containers in a pod) podSecurityContext: {} # fsGroup: 2000 # -- Privileges and access control settings for a container securityContext: privileged: true # -- Node selection constraint nodeSelector: {} # -- Toleration for taints tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master # -- Affinity rules affinity: {} # -- CPU/MEM resources resources: requests: cpu: 100m memory: 200Mi limits: cpu: 100m memory: 200Mi # -- Extra environment variables extraEnvVars: KEPLER_LOG_LEVEL: "1" ENABLE_GPU: "true" ENABLE_EBPF_CGROUPID: "true" EXPOSE_IRQ_COUNTER_METRICS: "true" EXPOSE_KUBELET_METRICS: "true" ENABLE_PROCESS_METRICS: "true" CPU_ARCH_OVERRIDE: "" CGROUP_METRICS: "*" service: annotations: {} type: ClusterIP port: 9102 serviceAccount: # Specifies whether a service account should be created create: true # Annotations to add to the service account annotations: {} # The name of the service account to use. # If not set and create is true, a name is generated using the fullname template name: "" serviceMonitor: enabled: true namespace: "" interval: 1m scrapeTimeout: 10s labels: {} ```

Related plugins (CNI, CSI, ...) and versions (if applicable)

MULTUS CNI
rootfs commented 1 year ago

could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory this message keeps showing up. It is possible the key has already been deleted but the not tracked correctly.

edoblette commented 1 year ago

Although the metrics seems to be correctly reported via port 9102/metrics. I think Kepler is working fine, but don't know why my service is Droped by prometheus. Any ideas ? @rootfs

rootfs commented 1 year ago

@edoblette how did you install prometheus? Did you use kube-prometheus operator or other mechansim?

rootfs commented 1 year ago

@edoblette if you use kube-prometheus, can you see kepler metrics in the prometheus query? Can you check the output of the following query?

kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"'

Do you happen to have any networkpolicy in place that blocks scraping? Can you post the output of

kubectl get networkpolicy,servicemonitor -A
edoblette commented 1 year ago

@edoblette how did you install prometheus? Did you use kube-prometheus operator or other mechansim?

I've used :

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts > /dev/null
helm repo update > /dev/null
helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace > /dev/null #--wait
kubectl expose service -n monitoring prometheus-server --type=NodePort --target-port=9090 --name=prometheus-server-ext
#kubectl get service -A 

#GET PROMETHEUS URL
PROM_URL=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }');
PROM_PORT=$(kubectl get -n monitoring -o jsonpath="{.spec.ports[0].nodePort}" services prometheus-server-ext);
echo "\n 🔥 Prometheus URL 🔥 \n http://${PROM_URL}:${PROM_PORT}";
edoblette commented 1 year ago

@edoblette if you use kube-prometheus, can you see kepler metrics in the prometheus query? Can you check the output of the following query?

kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"'

Do you happen to have any networkpolicy in place that blocks scraping? Can you post the output of

kubectl get networkpolicy,servicemonitor -A
$ kubectl get networkpolicy,servicemonitor -A
NAMESPACE   NAME                                                              AGE
kepler      servicemonitor.monitoring.coreos.com/kepler-prometheus-exporter   92m
rootfs commented 1 year ago

@edoblette how did you install prometheus? Did you use kube-prometheus operator or other mechansim?

I've used :

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts > /dev/null
helm repo update > /dev/null
helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace > /dev/null #--wait
kubectl expose service -n monitoring prometheus-server --type=NodePort --target-port=9090 --name=prometheus-server-ext
#kubectl get service -A 

#GET PROMETHEUS URL
PROM_URL=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }');
PROM_PORT=$(kubectl get -n monitoring -o jsonpath="{.spec.ports[0].nodePort}" services prometheus-server-ext);
echo "\n 🔥 Prometheus URL 🔥 \n http://${PROM_URL}:${PROM_PORT}";

The namespace monitoring is the same as kube-prometheus operator, can you check if prometheus query can get any kepler metrics?

kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"'
juangascon commented 1 year ago

Hello. I am experiencing the same issue with the Helm chart but not only in kind but also in GCP. I do not know enough of Helm charts to resolve the issue but I have troubleshooted that the installation goes well when installing manually: build the manifest with the option "OPTS="PROMETHEUS_DEPLOY" and deploy with kubectl. I do not know what the chart is doing differently of the manifests

rootfs commented 1 year ago

thank you @juangascon for the pointer!

@edoblette Can you get the service monitor yaml from your setup?

kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml

If it helps, can you use this yaml I just created with OPTS="PROMETHEUS_DEPLOY"?

kubectl apply -f https://gist.githubusercontent.com/rootfs/7ee3098af59b291964968e05536947dc/raw/1e0efa3be74e061f092c902751fc8989bf8fcde4/kepler-prometheus.yaml
edoblette commented 1 year ago

thank you @juangascon for the pointer!

@edoblette

Can you get the service monitor yaml from your setup?


kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml

If it helps, can you use this yaml I just created with OPTS="PROMETHEUS_DEPLOY"?


kubectl apply -f https://gist.githubusercontent.com/rootfs/7ee3098af59b291964968e05536947dc/raw/1e0efa3be74e061f092c902751fc8989bf8fcde4/kepler-prometheus.yaml

Thanks, I will try soon!

edoblette commented 1 year ago

When I execute kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml I get:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: kepler
    meta.helm.sh/release-namespace: kepler
  creationTimestamp: "2023-07-06T13:50:51Z"
  generation: 1
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kepler
    app.kubernetes.io/version: release-0.5.1
    helm.sh/chart: kepler-0.4.3
  name: kepler-prometheus-exporter
  namespace: kepler
  resourceVersion: "14823"
  uid: a20787e5-6819-4147-baff-6e28d60bdbe6
spec:
  endpoints:
  - interval: 1m
    path: /metrics
    port: http
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    scheme: http
    scrapeTimeout: 10s
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kepler
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler
rootfs commented 1 year ago

odd, that looks right to me. Not sure why prometheus doesn't see kepler exporter

rootfs commented 1 year ago

here is my local prometheus service discovery page image

edoblette commented 1 year ago

I'm out the office today, I will investigate further next week with the yaml file you gave me.

clin4 commented 1 year ago

for me, it sounds like your servicemonitor is missing a label, which let prometheus cannot discovery your servicemonitor. when you install the kepler via helm, besides enable service monitor, you also need provide a label, since I am using Terraform, the code looks like

resource "helm_release" "kepler" {
  name       = "kepler"
  repository = "https://sustainable-computing-io.github.io/kepler-helm-chart"
  chart      = "kepler"
  namespace = "kepler"

  create_namespace = true

  set {
    name = "serviceMonitor.enabled"
    value = true
  }

  set {
    name = "serviceMonitor.labels.release"
    value = "kube-prometheus-stack"
  }
}

In the end in the service monitor you can find this label

k get servicemonitor/kepler-prometheus-exporter -o yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: kepler
    meta.helm.sh/release-namespace: kube-prometheus-stack
  creationTimestamp: "2023-06-30T00:11:37Z"
  generation: 1
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kepler
    app.kubernetes.io/version: release-0.5
    helm.sh/chart: kepler-0.4.2
    release: kube-prometheus-stack
  name: kepler-prometheus-exporter
  namespace: kube-prometheus-stack
mcalman commented 11 months ago

I came across the same issue when using the kube-prometheus-stack helm chart. As @clin4 mentioned, in my case, it was due to a missing label. By default, the helm chart adds a service monitor selector (preventing Prometheus from matching all service monitors) unless you disable this configuration: https://github.com/prometheus-community/helm-charts/blob/f36d97ed314926a8a735a4d97f37af756ebc0bcb/charts/kube-prometheus-stack/values.yaml#L3021

findpaths commented 10 months ago

I came across the same issue when using the kube-prometheus-stack helm chart. As @clin4 mentioned, in my case, it was due to a missing label. By default, the helm chart adds a service monitor selector (preventing Prometheus from matching all service monitors) unless you disable this configuration: https://github.com/prometheus-community/helm-charts/blob/f36d97ed314926a8a735a4d97f37af756ebc0bcb/charts/kube-prometheus-stack/values.yaml#L3021

This is exactly it - I just resolved this in my setup. The helm install of kube-prometheus-stack will only add the servicemonitors that have a "release" label that matches the helm release name "prometheus" in my case.

To fix an already installed kepler - edit the servicemonitor definition and add the label "release: " or set serviceMonitorSelectorNilUsesHelmValues to false as part of the helm install of kube-prometheus-stack

marceloamaral commented 10 months ago

@edoblette was the issue solved? Can we close it?

rootfs commented 10 months ago

closing for now. Reopen if update available.

sc20tcl commented 4 months ago

Hello I am having the same errors as @edoblette where, although metrics are available on port 9102 of the kepler pod at localhost:9102/metrics, my Prometheus cannot access them.

I am running aks for my k8 cluster and have deployed Kepler and Prometheus as shown in this thread:

repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart
helm install kepler kepler/kepler --namespace kepler --create-namespace --values value.yaml
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts 
helm repo update 
helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace 

I have also followed the advice of @clin4, changing my value.yaml file so that the output of the service monitor :

kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml    

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
        annotations:
                meta.helm.sh/release-name: kepler
                meta.helm.sh/release-namespace: kepler
        creationTimestamp: "2024-03-16T23:29:26Z"
        generation: 1
labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kepler
    app.kubernetes.io/version: release-0.7.2
    helm.sh/chart: kepler-0.5.5
    release: kepler
name: kepler-prometheus-exporter
namespace: kepler

I made the assumption that the release and the namespace of where the kepler is held should be the same so have changed the release to be “kepler” instead of “ “kube-prometheus-stack”. Please inform me if this is wrong as this was a very uninformed assumption.

Currently the output of

kubectl exec -ti -n monitoring prometheus-server-6b56bf746f-8vzgd   -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"’

is:

Connecting to localhost:9090 (127.0.0.1:9090)
writing to stdout
-                    100% |*******************************************************************************************************************************************************************************|    63  0:00:00 ETA
written to stdout

and something else that I noticed that appeared different is my Prometheus Dashboard appeared different to that shown by @rootfs. (Not sure if this is relevant or not but I just thought I would point it out either way)

Screenshot 2024-03-17 at 00 14 16

Any help you can provide would be greatly appreciated - this is something I have been stuck on for weeks and just can’t seem to find a solution.

marvin-steinke commented 4 months ago

@sc20tcl Having the same issues. Can you give an update when you made any progress ^^?

marvin-steinke commented 4 months ago

@sc20tcl Ok I got it. we need to install the kube-prometheus-stack helm chart, enable the service monitor and add the release label of the prometheus release:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace monitoring \
    --create-namespace \
    --wait
helm install kepler kepler/kepler \
    --namespace kepler \
    --create-namespace \
    --set serviceMonitor.enabled=true \
    --set serviceMonitor.labels.release=prometheus \
sc20tcl commented 3 months ago

@marvin-steinke Thank you so much this has fixed it.

rootfs commented 3 months ago

@marvin-steinke can you post the instruction in kepler-doc here? Thanks

marvin-steinke commented 3 months ago

@rootfs Yessir!

sc20tcl commented 3 months ago

@rootfs @marvin-steinke Okay so by fixing my issue connecting Kepler to Prometheus and thus Grafana I have uncovered a new error around Kepler collecting data.

Like I have previously mentioned I am running my cluster on azure AKS and in this cluster I am using TeaStore as a dummy web app to be able to run tests on energy consumption. However for some reason Kepler is collecting some data but not all.

Viewing the metrics through the web-ui most of the metrics I need to extract surrounding TeaStore and the "default" namespace just produce 0 (as you can see in the file below). Yet it does produce outputs for the node totals. Is there a fix for this that you have come across or does Kepler simply not work for Azure AKS?

metrics-TeaStore-notworking.pdf

Thanks.

rootfs commented 3 months ago

@sc20tcl there are some activities from the teaStore pod:

kepler_container_joules_total{container_id="6a2c668d17d9acd36f08426733ff89ae57bd0e2717a7409b8e8027258481d54f",container_name="teastoredb",container_namespace="default",mode="dynamic",pod_name="teastore-db-7b99fb9d86-rk26f",source=""} 5
kepler_container_joules_total{container_id="6a2c668d17d9acd36f08426733ff89ae57bd0e2717a7409b8e8027258481d54f",container_name="teastoredb",container_namespace="default",mode="idle",pod_name="teastore-db-7b99fb9d86-rk26f",source=""} 707
sc20tcl commented 3 months ago
Screenshot 2024-04-04 at 17 42 16

@rootfs

As you can see the pods registered power when they started up but then produced no more readings despite a number of CPU-intensive stress tests being performed.

rootfs commented 3 months ago

that is odd.

@vprashar2929 would you please check out teastore-db test with kepler? Thanks

sc20tcl commented 3 months ago

@rootfs @vprashar2929 Any update on finding a solution (or even the cause) to this issue?

Just for reference, following the troubleshooting advice on https://sustainable-computing.io/usage/trouble_shooting/, I have made sure to check my cGroup version:

/ # cat /sys/fs/cgroup/cgroup.controllers

cpuset cpu io memory hugetlb pids rdma misc

And besides this I believe all AKS clusters above version 1.25 (which mine is) have cGroup V2 as standard.

vprashar2929 commented 3 months ago

@sc20tcl I tried running teastore on OpenShift(BM) having Kepler release-0.7.8 deployed. I can see Kepler reporting energy consumption for teastore-related pods.

In case of BM:

bm

From the logs, you shared above looks like you are running Kepler in the VM that's why it is using source="trained_power_model" which uses power models for calculating usage.

@rootfs I also tried it against VM and I can also see kepler not reporting values for package, dram

In case of VM:

vm

sc20tcl commented 3 months ago

@rootfs @vprashar2929 Ok considering I am using Azure AKS it means you have replicated my findings and there is an error with Kepler and/or how it works with the VM.

Also I think you have used the same image twice for both BM and VM.