Closed edoblette closed 10 months ago
could not delete bpf table elements, err: Table.Delete: key 0x3b124e: no such file or directory
this message keeps showing up. It is possible the key has already been deleted but the not tracked correctly.
Although the metrics seems to be correctly reported via port 9102/metrics. I think Kepler is working fine, but don't know why my service is Droped by prometheus. Any ideas ? @rootfs
@edoblette how did you install prometheus? Did you use kube-prometheus operator or other mechansim?
@edoblette if you use kube-prometheus, can you see kepler metrics in the prometheus query? Can you check the output of the following query?
kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"'
Do you happen to have any networkpolicy in place that blocks scraping? Can you post the output of
kubectl get networkpolicy,servicemonitor -A
@edoblette how did you install prometheus? Did you use kube-prometheus operator or other mechansim?
I've used :
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts > /dev/null
helm repo update > /dev/null
helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace > /dev/null #--wait
kubectl expose service -n monitoring prometheus-server --type=NodePort --target-port=9090 --name=prometheus-server-ext
#kubectl get service -A
#GET PROMETHEUS URL
PROM_URL=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }');
PROM_PORT=$(kubectl get -n monitoring -o jsonpath="{.spec.ports[0].nodePort}" services prometheus-server-ext);
echo "\n 🔥 Prometheus URL 🔥 \n http://${PROM_URL}:${PROM_PORT}";
@edoblette if you use kube-prometheus, can you see kepler metrics in the prometheus query? Can you check the output of the following query?
kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"'
Do you happen to have any networkpolicy in place that blocks scraping? Can you post the output of
kubectl get networkpolicy,servicemonitor -A
$ kubectl get networkpolicy,servicemonitor -A
NAMESPACE NAME AGE
kepler servicemonitor.monitoring.coreos.com/kepler-prometheus-exporter 92m
@edoblette how did you install prometheus? Did you use kube-prometheus operator or other mechansim?
I've used :
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts > /dev/null helm repo update > /dev/null helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace > /dev/null #--wait kubectl expose service -n monitoring prometheus-server --type=NodePort --target-port=9090 --name=prometheus-server-ext #kubectl get service -A #GET PROMETHEUS URL PROM_URL=$(kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }'); PROM_PORT=$(kubectl get -n monitoring -o jsonpath="{.spec.ports[0].nodePort}" services prometheus-server-ext); echo "\n 🔥 Prometheus URL 🔥 \n http://${PROM_URL}:${PROM_PORT}";
The namespace monitoring
is the same as kube-prometheus
operator, can you check if prometheus query can get any kepler metrics?
kubectl exec -ti -n monitoring prometheus-k8s-0 -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"'
Hello. I am experiencing the same issue with the Helm chart but not only in kind but also in GCP. I do not know enough of Helm charts to resolve the issue but I have troubleshooted that the installation goes well when installing manually: build the manifest with the option "OPTS="PROMETHEUS_DEPLOY" and deploy with kubectl. I do not know what the chart is doing differently of the manifests
thank you @juangascon for the pointer!
@edoblette Can you get the service monitor yaml from your setup?
kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml
If it helps, can you use this yaml I just created with OPTS="PROMETHEUS_DEPLOY"?
kubectl apply -f https://gist.githubusercontent.com/rootfs/7ee3098af59b291964968e05536947dc/raw/1e0efa3be74e061f092c902751fc8989bf8fcde4/kepler-prometheus.yaml
thank you @juangascon for the pointer!
@edoblette
Can you get the service monitor yaml from your setup?
kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml
If it helps, can you use this yaml I just created with OPTS="PROMETHEUS_DEPLOY"?
kubectl apply -f https://gist.githubusercontent.com/rootfs/7ee3098af59b291964968e05536947dc/raw/1e0efa3be74e061f092c902751fc8989bf8fcde4/kepler-prometheus.yaml
Thanks, I will try soon!
When I execute kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml
I get:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: kepler
meta.helm.sh/release-namespace: kepler
creationTimestamp: "2023-07-06T13:50:51Z"
generation: 1
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kepler
app.kubernetes.io/version: release-0.5.1
helm.sh/chart: kepler-0.4.3
name: kepler-prometheus-exporter
namespace: kepler
resourceVersion: "14823"
uid: a20787e5-6819-4147-baff-6e28d60bdbe6
spec:
endpoints:
- interval: 1m
path: /metrics
port: http
relabelings:
- action: replace
regex: (.*)
replacement: $1
sourceLabels:
- __meta_kubernetes_pod_node_name
targetLabel: instance
scheme: http
scrapeTimeout: 10s
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- kepler
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kepler
odd, that looks right to me. Not sure why prometheus doesn't see kepler exporter
here is my local prometheus service discovery page
I'm out the office today, I will investigate further next week with the yaml file you gave me.
for me, it sounds like your servicemonitor is missing a label, which let prometheus cannot discovery your servicemonitor. when you install the kepler via helm, besides enable service monitor, you also need provide a label, since I am using Terraform, the code looks like
resource "helm_release" "kepler" {
name = "kepler"
repository = "https://sustainable-computing-io.github.io/kepler-helm-chart"
chart = "kepler"
namespace = "kepler"
create_namespace = true
set {
name = "serviceMonitor.enabled"
value = true
}
set {
name = "serviceMonitor.labels.release"
value = "kube-prometheus-stack"
}
}
In the end in the service monitor you can find this label
k get servicemonitor/kepler-prometheus-exporter -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: kepler
meta.helm.sh/release-namespace: kube-prometheus-stack
creationTimestamp: "2023-06-30T00:11:37Z"
generation: 1
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kepler
app.kubernetes.io/version: release-0.5
helm.sh/chart: kepler-0.4.2
release: kube-prometheus-stack
name: kepler-prometheus-exporter
namespace: kube-prometheus-stack
I came across the same issue when using the kube-prometheus-stack
helm chart. As @clin4 mentioned, in my case, it was due to a missing label. By default, the helm chart adds a service monitor selector (preventing Prometheus from matching all service monitors) unless you disable this configuration: https://github.com/prometheus-community/helm-charts/blob/f36d97ed314926a8a735a4d97f37af756ebc0bcb/charts/kube-prometheus-stack/values.yaml#L3021
I came across the same issue when using the
kube-prometheus-stack
helm chart. As @clin4 mentioned, in my case, it was due to a missing label. By default, the helm chart adds a service monitor selector (preventing Prometheus from matching all service monitors) unless you disable this configuration: https://github.com/prometheus-community/helm-charts/blob/f36d97ed314926a8a735a4d97f37af756ebc0bcb/charts/kube-prometheus-stack/values.yaml#L3021
This is exactly it - I just resolved this in my setup. The helm install of kube-prometheus-stack will only add the servicemonitors that have a "release" label that matches the helm release name "prometheus" in my case.
To fix an already installed kepler - edit the servicemonitor definition and add the label "release:
@edoblette was the issue solved? Can we close it?
closing for now. Reopen if update available.
Hello I am having the same errors as @edoblette where, although metrics are available on port 9102 of the kepler pod at localhost:9102/metrics, my Prometheus cannot access them.
I am running aks for my k8 cluster and have deployed Kepler and Prometheus as shown in this thread:
repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart
helm install kepler kepler/kepler --namespace kepler --create-namespace --values value.yaml
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus --namespace monitoring --create-namespace
I have also followed the advice of @clin4, changing my value.yaml file so that the output of the service monitor :
kubectl get servicemonitor -n kepler kepler-prometheus-exporter -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
annotations:
meta.helm.sh/release-name: kepler
meta.helm.sh/release-namespace: kepler
creationTimestamp: "2024-03-16T23:29:26Z"
generation: 1
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kepler
app.kubernetes.io/version: release-0.7.2
helm.sh/chart: kepler-0.5.5
release: kepler
name: kepler-prometheus-exporter
namespace: kepler
I made the assumption that the release and the namespace of where the kepler is held should be the same so have changed the release to be “kepler” instead of “ “kube-prometheus-stack”. Please inform me if this is wrong as this was a very uninformed assumption.
Currently the output of
kubectl exec -ti -n monitoring prometheus-server-6b56bf746f-8vzgd -- sh -c 'wget -O- "localhost:9090/api/v1/query?query=kepler_container_joules_total[200s]"’
is:
Connecting to localhost:9090 (127.0.0.1:9090)
writing to stdout
- 100% |*******************************************************************************************************************************************************************************| 63 0:00:00 ETA
written to stdout
and something else that I noticed that appeared different is my Prometheus Dashboard appeared different to that shown by @rootfs. (Not sure if this is relevant or not but I just thought I would point it out either way)
Any help you can provide would be greatly appreciated - this is something I have been stuck on for weeks and just can’t seem to find a solution.
@sc20tcl Having the same issues. Can you give an update when you made any progress ^^?
@sc20tcl Ok I got it. we need to install the kube-prometheus-stack helm chart, enable the service monitor and add the release label of the prometheus release:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kepler https://sustainable-computing-io.github.io/kepler-helm-chart
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--wait
helm install kepler kepler/kepler \
--namespace kepler \
--create-namespace \
--set serviceMonitor.enabled=true \
--set serviceMonitor.labels.release=prometheus \
@marvin-steinke Thank you so much this has fixed it.
@marvin-steinke can you post the instruction in kepler-doc here? Thanks
@rootfs Yessir!
@rootfs @marvin-steinke Okay so by fixing my issue connecting Kepler to Prometheus and thus Grafana I have uncovered a new error around Kepler collecting data.
Like I have previously mentioned I am running my cluster on azure AKS and in this cluster I am using TeaStore as a dummy web app to be able to run tests on energy consumption. However for some reason Kepler is collecting some data but not all.
Viewing the metrics through the web-ui most of the metrics I need to extract surrounding TeaStore and the "default" namespace just produce 0 (as you can see in the file below). Yet it does produce outputs for the node totals. Is there a fix for this that you have come across or does Kepler simply not work for Azure AKS?
metrics-TeaStore-notworking.pdf
Thanks.
@sc20tcl there are some activities from the teaStore pod:
kepler_container_joules_total{container_id="6a2c668d17d9acd36f08426733ff89ae57bd0e2717a7409b8e8027258481d54f",container_name="teastoredb",container_namespace="default",mode="dynamic",pod_name="teastore-db-7b99fb9d86-rk26f",source=""} 5
kepler_container_joules_total{container_id="6a2c668d17d9acd36f08426733ff89ae57bd0e2717a7409b8e8027258481d54f",container_name="teastoredb",container_namespace="default",mode="idle",pod_name="teastore-db-7b99fb9d86-rk26f",source=""} 707
@rootfs
As you can see the pods registered power when they started up but then produced no more readings despite a number of CPU-intensive stress tests being performed.
that is odd.
@vprashar2929 would you please check out teastore-db test with kepler? Thanks
@rootfs @vprashar2929 Any update on finding a solution (or even the cause) to this issue?
Just for reference, following the troubleshooting advice on https://sustainable-computing.io/usage/trouble_shooting/, I have made sure to check my cGroup version:
/ # cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
And besides this I believe all AKS clusters above version 1.25 (which mine is) have cGroup V2 as standard.
@sc20tcl I tried running teastore on OpenShift(BM) having Kepler release-0.7.8
deployed. I can see Kepler reporting energy consumption for teastore-related pods.
In case of BM:
From the logs, you shared above looks like you are running Kepler in the VM that's why it is using source="trained_power_model"
which uses power models for calculating usage.
@rootfs I also tried it against VM and I can also see kepler not reporting values for package, dram
In case of VM:
@rootfs @vprashar2929 Ok considering I am using Azure AKS it means you have replicated my findings and there is an error with Kepler and/or how it works with the VM.
Also I think you have used the same image twice for both BM and VM.
What happened?
Unable to export metrics to prometheus. Kepler appears to be
"Dropped"
from prometheus's/service-discovery
page. I'm reporting this problem from previous issue @ How to select the tag of kepler-helm-chart to install Kepler? from Kepler Helm deployement, following the initiative of @rootfs and @LAI-chuchiHowever, metrics are available on port 9102 of the kepler pod at http://:9102/metrics.
I'm using Kind for my K8s cluster, with Prometheus and Grafana already deployed. I also use Cilium without any problems.
What did you expect to happen?
Get Kepler metrics enable on my prometheus queries dashbord.
How can we reproduce it (as minimally and precisely as possible)?
Linux 22.04 Kernel: 5.5.0-050500-generic kind version 0.18.0*
Anything else we need to know?
Log from kepler pod:
The metrics from my pod (http://172.18.0.2:9102/metrics):
Kepler image tag:
Kubernetes version
OS version
Kepler deployment config
Related plugins (CNI, CSI, ...) and versions (if applicable)