Open novacain1 opened 6 months ago
Hi @novacain1, just to understand better the issue:
rate()
function in the query? localhost:9090/api/v1/query?query=rate(kepler_container_package_joules_total[30s])
Hi @marceloamaral thank you for your help. Answers to your questions:
With that query, I get marginal difference in energy usage at 1m, nothing substantial. I suspect I am using a longer poll period from prometheus (OpenShift Observability Add-On), so 30s actually returns no data.
This is a question for @rootfs : is the image using the code that was refactored?
yes, this image has fix in #1185
Regarding the dashboard, the green one is the efficient node while the red one is the power hungry node. The node metrics looks right (green one is ~29% more efficient than red one) but the pod level metrics is not right: the green one uses more power than the red one.
@marceloamaral this needs some thinking, the idle power formula uses the cpu time that we know has some problem
func (ne *NodeStats) UpdateIdleEnergyWithMinValue(isComponentsSystemCollectionSupported bool) {
// gpu metric
if config.EnabledGPU && gpu.IsGPUCollectionSupported() {
ne.CalcIdleEnergy(config.AbsEnergyInGPU, config.IdleEnergyInGPU, config.GPUSMUtilization)
}
if isComponentsSystemCollectionSupported {
ne.CalcIdleEnergy(config.AbsEnergyInCore, config.IdleEnergyInCore, config.CPUTime)
ne.CalcIdleEnergy(config.AbsEnergyInDRAM, config.IdleEnergyInDRAM, config.CPUTime) // TODO: we should use another resource for DRAM
ne.CalcIdleEnergy(config.AbsEnergyInUnCore, config.IdleEnergyInUnCore, config.CPUTime)
ne.CalcIdleEnergy(config.AbsEnergyInPkg, config.IdleEnergyInPkg, config.CPUTime)
ne.CalcIdleEnergy(config.AbsEnergyInPlatform, config.IdleEnergyInPlatform, config.CPUTime)
}
}
What happened?
I have a workload that runs on the realtime kernel, and previously was using cgroup metrics in Kepler for estimating energy consumption. With that workload back in May of this year, I was seeing around 73W usage for my energy hungry cluster, and 48.6 W usage in my energy efficient cluster. The lower wattage usage was because I was using energy savings features like C-States, P states, and per pod power management. The 73W usage was in a cluster running wide open (no c-states, max frequency, all cores idle=poll which uses way more energy). Kepler was very helpful for me to understand how much energy I was using at the container level.
With the latest version, I wasn't even getting ebpf metrics (https://github.com/sustainable-computing-io/kepler/issues/1175), with the realtime kernel, which has since been fixed with an image from https://github.com/sustainable-computing-io/kepler/pull/1185.
In using the same workload from May, with the latest 1185 image, I see unexpected results with the idle energy consumption from the workload namespace. I see 12.3W in my energy hungry cluster, and 47.6 W in my energy efficient cluster. This does not make sense to me.
Idle power
Dynamic power
Output captured from Kepler in June 2023:![image](https://github.com/sustainable-computing-io/kepler/assets/10089626/bb4f4ff7-5cd0-4846-822c-63984af446cd)
Ouput captured from Kepler in January 2024. My concern is the container metrics which appear at the bottom of this picture captured from my Grafana instance:![image](https://github.com/sustainable-computing-io/kepler/assets/10089626/7d491f96-eee9-45af-a71e-d103f364ebeb)
metrics being used here are: kepler_container_package_joules_total and kepler_container_dram_joules_total.
thanks in advance for your help.
What did you expect to happen?
Power usage to be roughly the same from Kepler usage (older version) back from June.
How can we reproduce it (as minimally and precisely as possible)?
Have been working with @rootfs on this, but I think conceivably one could deploy OpenShift using cgroups v1 and enabling the realtime kernel with the PerformanceProfile CR and see this behavior.
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)