sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.11k stars 176 forks source link

bpf: Investigate the *best* value for wakeup_data_size #1660

Open dave-tucker opened 1 month ago

dave-tucker commented 1 month ago

What would you like to be added?

This constant: https://github.com/sustainable-computing-io/kepler/blob/main/bpf/kepler.bpf.c#L70

Declares how often we wake up to read the ringbuf.

The current math was as follows:

  1. My system (on average) processes around 600-700 context switches per second
  2. The sample period in Kepler is once every 3 seconds
  3. We need to read at least one batch of ringbuf events within that 3 second interval

So 1000 should have me read every 1.7ish seconds 😄

Why is this needed?

When kepler wakes up to read events it consumes CPU. Right now that's showing us as being somewhere between 1-3% mean CPU usage over time. We should consider whether there is a better formula we could use to compute this magic number of 1000.

It could relate to the sample rate.

e.g 500 * SampleRate and perhaps even the 500 could come from something better than an educated guess.

rootfs commented 1 month ago

The Kepler CPU usage under normal and stress workloads need to be investigated in parallel. The latest stress test results point to a divergence that needs to be fixed.

rootfs commented 1 month ago

The current Kepler CPU usage is now 20% without running load. asciicast

rootfs commented 1 month ago

Test results posted on the original PR https://github.com/sustainable-computing-io/kepler/pull/1628

dave-tucker commented 1 month ago

The current Kepler CPU usage is now 20% without running load. asciicast

How, and on what machine, can I reproduce this result?

rootfs commented 1 month ago

@dave-tucker load the kepler latest image and keep it running for a day.

dave-tucker commented 1 month ago

Test results posted on the original PR #1628

Responded: https://github.com/sustainable-computing-io/kepler/pull/1628#issuecomment-2269058775