Open AntonioDiTuri opened 3 months ago
Hi @AntonioDiTuri , thanks for the bug report. I've tried to reproduce this on my kind cluster but I've not been able to do so. Only difference is that I'm on Linux:
Will try again on macOS and see if I can get a reproducer
@AntonioDiTuri I've now reproduced this 🎉
When deploying kepler-exporter into a VM environment where hardware perf events are not available we fall back to using bpf_cpu_time_ms
as input into the model for estimation.
However, this appears to be incorrectly accounted in the situation described above:
macos via podman desktop (basically a vm) - no hardware perf event support:
linux via kind - hardware perf event support
I'll start looking in to this and I hope to get a fix ready soon.
I am guessing that this happen because of the default model packaged in kepler image. cc: @sunya-ch
@dave-tucker Do you get the same result if you used the vm docker-compose ? The compose deploys both estimator sidecar and model-server which downloads better model.
@sthaha I don't think its the model... I think it's the input into the model. Per the screenshots above bpf_cpu_time_ms remains flat over time in one environment vs the other (correct) example where it increases over time
I can try compose to prove this
@dave-tucker you are right, I totally missed the comment - no hardware perf event support:
on flat bpf time graph 🙄 .
To find if this is a regression (to stop the release of 0.7.11) I have been testing this on OSX first using 0.7.10 and then using the latest
I found that there are a few containers that have rate of change of bpf time as 0
There are few that are reported has not having consumed any/ 0 CPU time (so this isn't a regression per se)
While those that have consumed some are minimal.
... to be continued
@sthaha I've had mixed results reproducing this. What I did see though is that when we have no perf events enabled we just use bpf_cpu_time_ms
in the model. Looking at the eBPF probe code in isolation I see that it reliably calculates this value and updates the maps.
Given that prometheus scraping still occurs AND the value remains constant it would seem that the main loop of Kepler is deadlocked somewhere - as IIRC that's what updates the prometheus metrics every 3 seconds. I've not been able to reproduce this with debug logging on to confirm but ☝️ appears to be the most likely cause.
What happened?
I installed kepler on a local kind cluster following the documentation, specifically using the make cluster-up command. I installed the last version of Kepler: release-0.7.10. My hardware is Mac with intel processor.
I then run a simple python code to check what was the consumption for demo purpose.
This is the simple python code:
This is the dockerfile:
And this is the simple pod.yaml I used to deploy:
What did you expect to happen?
When querying in grafana the simple metric: namely kepler_container_joules_total The value of the metric does not increase over time. It stays costant, the queries with mode=idle and mode=dynamic return different numbers but still costant.
Since the container is always on and uses constantly the following resources:
I would have expected the energy value to rise. But that is not the case. Why? I am attaching a screen of the grafana:
[EDIT] from the screenshot it might seem that I observed this only for few minutes and I did not give enough time to the power models to "trigger a change" in the output. However observing the process for 1h also made no difference, the energy stays constant
Am I missing something?
How can we reproduce it (as minimally and precisely as possible)?
kind load docker-image $IMAGE_NAME
kepler_container_joules_total{pod_name="fibonacci-inefficient"}
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)