sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.16k stars 182 forks source link

Prometheus Not Discovering Kepler Service in Kubernetes #1144

Closed bellalzohir closed 8 months ago

bellalzohir commented 10 months ago

What happened?

I have successfully deployed Kepler for energy monitoring in my Kubernetes cluster, but I am encountering an issue with Prometheus integration. While I can collect metrics locally from Kepler, these metrics are not being reported to Prometheus. I suspect the issue might be related to the service discovery configuration in Prometheus.

Deployed Kepler using the following commands:

git clone --depth 1 git@github.com:sustainable-computing-io/kepler.git
cd ./kepler
make build-manifest OPTS="PROMETHEUS_DEPLOY"
kubectl apply -f _output/generated-manifest/deployment.yaml

Deployed Prometheus using:

kubectl apply --server-side -f manifests/setup -n kepler
until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
kubectl apply -f manifests/

Able to fetch metrics locally using: kubectl exec -ti -n kepler daemonset/kepler-exporter -- bash -c "curl localhost:9102/metrics > /tmp/k.log; grep kepler_container_joules /tmp/k.log | sort -k 2 -g"

PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/part-of: kepler
    sustainable-computing.io/app: kepler
  name: kepler-common-rules
  namespace: kepler
spec:
  groups:
  - interval: 1m
    name: kepler-common-rules
    rules:
    - expr: |
        sum by(container_namespace, pod_name) (
          increase(kepler_container_joules_total[24h:2m])
        )
      record: kepler:container_joules_total:increase:24h:by_ns_pod
    - expr: "sum by(container_namespace) ( \n  kepler:container_joules_total:increase:24h:by_ns_pod
        \n)\n"
      record: kepler:container_joules_total:increase:24h:by_ns
  - interval: 30s
    name: kepler-low-res-rules
    rules:
    - expr: |
        sum by (container_namespace, pod_name) (
          irate(kepler_container_package_joules_total[2m])
        )
      record: kepler:container_package_watts:2m:by_ns_pod
    - expr: |
        sum by (container_namespace, pod_name) (
          irate(kepler_container_dram_joules_total[2m])
        )
      record: kepler:container_dram_watts:2m:by_ns_pod
    - expr: |
        sum by (container_namespace, pod_name) (
          irate(kepler_container_other_joules_total[2m])
        )
      record: kepler:container_other_watts:2m:by_ns_pod
    - expr: |
        sum by (container_namespace, pod_name) (
          irate(kepler_container_gpu_joules_total[2m])
        )
      record: kepler:container_gpu_watts:2m:by_ns_pod

ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/component: exporter
    app.kubernetes.io/name: kepler-exporter
    sustainable-computing.io/app: kepler
  name: kepler-exporter
  namespace: monitoring
spec:
  endpoints:
  - interval: 30s
    port: http
    relabelings:
    - action: replace
      regex: (.*)
      replacement: $1
      sourceLabels:
      - __meta_kubernetes_pod_node_name
      targetLabel: instance
    scheme: http
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kepler
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler-exporter

Service details:

NAMESPACE     NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                        AGE
default       kubernetes                ClusterIP   10.96.0.1        <none>        443/TCP                        49d
kepler        kepler-exporter           ClusterIP   None             <none>        9102/TCP                       53m
kube-system   kube-dns                  ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP         49d
kube-system   kubelet                   ClusterIP   None             <none>        10250/TCP,10255/TCP,4194/TCP   44h
kube-system   metallb-webhook-service   ClusterIP   10.104.92.55     <none>        443/TCP                        48d
monitoring    alertmanager-main         ClusterIP   10.99.220.17     <none>        9093/TCP,8080/TCP              55m
monitoring    alertmanager-operated     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP     55m
monitoring    blackbox-exporter         ClusterIP   10.97.54.178     <none>        9115/TCP,19115/TCP             55m
monitoring    grafana                   ClusterIP   10.105.228.245   <none>        3000/TCP                       55m
monitoring    kube-state-metrics        ClusterIP   None             <none>        8443/TCP,9443/TCP              55m
monitoring    node-exporter             ClusterIP   None             <none>        9100/TCP                       55m
monitoring    prometheus-adapter        ClusterIP   10.100.62.77     <none>        443/TCP                        55m
monitoring    prometheus-k8s            ClusterIP   10.106.221.8     <none>        9090/TCP,8080/TCP              55m
monitoring    prometheus-operated       ClusterIP   None             <none>        9090/TCP                       55m
monitoring    prometheu s-operator       ClusterIP   None             <none>        8443/TCP                       55m

What did you expect to happen?

Prometheus discovers the Kepler service and reports metrics.

How can we reproduce it (as minimally and precisely as possible)?

Deployed Kepler using the following commands:

#git clone --depth 1 git@github.com:sustainable-computing-io/kepler.git
#cd ./kepler
#make build-manifest OPTS="PROMETHEUS_DEPLOY"
#kubectl apply -f _output/generated-manifest/deployment.yaml

Deployed Prometheus using:

kubectl apply --server-side -f manifests/setup -n kepler
until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
kubectl apply -f manifests/

Anything else we need to know?

No response

Kepler image tag

Last

Kubernetes version

```console $ kubectl version ```Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.3

Cloud provider or bare metal

BM

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu" VERSION_ID="22.04" # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.