sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics
https://sustainable-computing.io
Apache License 2.0
1.12k stars 177 forks source link

Reduce eBPF Probe Execution Time #1611

Open dave-tucker opened 2 months ago

dave-tucker commented 2 months ago

What would you like to be added?

Summary

1535 adds some microbenchmarks for the eBPF probes.

The sched_switch probe is the most critical and it currently measures at almost 3 microseconds on my system - we'd like to get this somewhere in the order of a few hundred nanoseconds. This is critical since that code is executed every time a task is scheduled on or off the CPU.

I've profiled the probe as follows:

In summary the hot parts of the probe appear to be as follows:

Code Percent
Preparing Context 46.32%
bpf_map_lookup_elem(&processes, &prev_tgid) 4.35%
bpf_perf_event_read_value - cache miss 2.9%
bpf_get_current_pid_tgid() 2.79%
bpf_map_update_elem(&cpu_instructions, cpu_id, &val, BPF_ANY); 2.76%
bpf_map_lookup_elem(&processes, &curr_tgid) 2.62%
bpf_map_delete_elem(&pid_time_map, &prev_pid) 2.38%
bpf_map_update_elem(&cache_miss, cpu_id, &val, BPF_ANY); 1.98%
bpf_ktime_get_ns() 1.82%
bpf_perf_event_read_value - cpu_instructions 1.63%

The time spent preparing the context is a fixed overhead - mostly due to spectre/meltdown prevention code in kernel that we are not in control of. When looking at what out probe code contributes you could view this as map operations being the key contributor to the overall probe execution time.

Proposal

Firstly we're only going to collect the following information on a sched_switch event:

  1. Timestamp - bpf_ktime_get_ns()
  2. Which CPU this event is for
  3. prev_task->tgid
  4. next_task->tgid
  5. Value of the cpu_cycles hardware counter

Secondly that information is going to immediately be sent back to userland using a BPF_MAP_TYPE_RINGBUF

Userland is going to constant read from that ring buffer and will be responsible for performing the delta calculations that were previously done in the kernel.

This should get our eBPF probe execution time down really low.

Why is this needed?

Previously sampling was used to reduce probe execution time. Per the discussion in #1607 with recent changes to the eBPF probes for correctness, this is no longer yielding any benefit. We do however need to reduce the probe execution time in order to have less impact on the system.

sthaha commented 2 months ago

I like the idea of collecting only the required (raw) information through ebpf and offload all calculations to userland. Definitely worth a try.