Open bjornpijnacker opened 7 months ago
There are similar issues reported elsewhere. We have not been able to reproduced yet.
For debugging, can you get the `sum(kepler_container_joules_total) from prometheus during the spike time? That'll help us find whether this is due to the kepler metrics or from the calculation used in the grafana dashboard.
That gives me ~22.3Mil summing over the half hour of the spike. Another spike has happened since with ~37.7Mil sum. Each of the three spikes seems to last close to exactly half an hour.
Hope this helps, if you need more info do let me know!
Thanks @bjornpijnacker The two potential issues are:
rate()
or irate()
in the dashboard caused this overflow. This could happen if there are mismatched data type or timestamp.We have to narrow down the scenarios. For the first case, it is best to also check the prometheus graph to see if the raw kepler metric sum(kepler_container_joules_total)
has any spike.
This is one of dashboard graphs where the spikes are evident. This is the last 7 days in the default dashboard. Below is sum(kepler_container_joules_total)
and sum(rate(kepler_container_joules_total[1m]))
in the last 7 days respectively.
Unfortunately, we got the same issue on our installation (baremetal). Kepler version: release-0.7.8
After we downgraded to kepler 0.7.2, values reporting are stable again. See also https://github.com/sustainable-computing-io/kepler/issues/1279
What happened?
Kepler shows some measurements in the dashboard that seem wrong. See the screenshot. Reported usage is about ~9kW which cannot be correct as this clusters consists of two PCs with a 35W power supply each.
What did you expect to happen?
The measurements to be correct.
How can we reproduce it (as minimally and precisely as possible)?
Unknown. I was not doing anything special with the cluster when this happened; in fact, I was asleep. It has happened twice so far; the other time at ~5kW a few days earlier. No logging exists from this time in Kepler.
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)