What would you like to be added?

Summary

1535 adds some microbenchmarks for the eBPF probes.

The sched_switch probe is the most critical and it currently measures at almost 3 microseconds on my system - we'd like to get this somewhere in the order of a few hundred nanoseconds. This is critical since that code is executed every time a task is scheduled on or off the CPU.

I've profiled the probe as follows:

Running kepler
Using perf record to gather samples
Analyze using sudo perf annotate -l bpf_prog_eb5663401a302a92_kepler_sched_switch_trace - you can get the tag from bpftool prog

In summary the hot parts of the probe appear to be as follows:

Code	Percent
Preparing Context	46.32%
bpf_map_lookup_elem(&processes, &prev_tgid)	4.35%
bpf_perf_event_read_value - cache miss	2.9%
bpf_get_current_pid_tgid()	2.79%
bpf_map_update_elem(&cpu_instructions, cpu_id, &val, BPF_ANY);	2.76%
bpf_map_lookup_elem(&processes, &curr_tgid)	2.62%
bpf_map_delete_elem(&pid_time_map, &prev_pid)	2.38%
bpf_map_update_elem(&cache_miss, cpu_id, &val, BPF_ANY);	1.98%
bpf_ktime_get_ns()	1.82%
bpf_perf_event_read_value - cpu_instructions	1.63%

The time spent preparing the context is a fixed overhead - mostly due to spectre/meltdown prevention code in kernel that we are not in control of. When looking at what out probe code contributes you could view this as map operations being the key contributor to the overall probe execution time.

Proposal

Firstly we're only going to collect the following information on a sched_switch event:

Timestamp - bpf_ktime_get_ns()
Which CPU this event is for
prev_task->tgid
next_task->tgid
Value of the cpu_cycles hardware counter

Secondly that information is going to immediately be sent back to userland using a BPF_MAP_TYPE_RINGBUF

Userland is going to constant read from that ring buffer and will be responsible for performing the delta calculations that were previously done in the kernel.

This should get our eBPF probe execution time down really low.

Why is this needed?

Previously sampling was used to reduce probe execution time. Per the discussion in #1607 with recent changes to the eBPF probes for correctness, this is no longer yielding any benefit. We do however need to reduce the probe execution time in order to have less impact on the system.

sustainable-computing-io / kepler

Reduce eBPF Probe Execution Time #1611

What would you like to be added?

Summary

1535 adds some microbenchmarks for the eBPF probes.

Proposal

Why is this needed?