bpf sampling no longer reduces overhead

rootfs commented 2 weeks ago

What happened?

Latest the EXPERIMENTAL_BPF_SAMPLE_RATE in kepler ebpf code no longer shows any difference in reducing ebpf overhead. Note, in the latest, the map collection is still executed even sampling is enabled. This is not the case in 0.7.8 code.

I compared 0.7.8 and latest code and calculated the ebpf per call overhead in ns. In 0.7.8 code, when setting EXPERIMENTAL_BPF_SAMPLE_RATE to 1000, the per call overhead is reduced from 12043.3 ns to 141.279 ns. While in latest code, there is almost no difference (10187.8 ns vs 10100 ns).

What did you expect to happen?

EXPERIMENTAL_BPF_SAMPLE_RATE should be able to reduce ebpf overhead

How can we reproduce it (as minimally and precisely as possible)?

Run the bpftool command on kepler latest and 0.7.8

sysctl kernel.bpf_stats_enabled=1

# latest without sampling
echo 0 | sudo tee  /etc/kepler/kepler.config/EXPERIMENTAL_BPF_SAMPLE_RATE
bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '

# enable sampling 
echo 1000 | sudo tee  /etc/kepler/kepler.config/EXPERIMENTAL_BPF_SAMPLE_RATE
bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '

Anything else we need to know?

No response

Kepler image tag

0.7.8 and latest

Kubernetes version

```console $ kubectl version # paste output here ```

Cloud provider or bare metal

KVM

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs commented 2 weeks ago

@sthaha @dave-tucker can you get this fixed before 0.7.11 release? thanks

sthaha commented 2 weeks ago

@rootfs I ran the same on my machine and here are the results

# no sampling
❯ sudo bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } ' | tee -a kepler-0.log
rt_ns: 572554527 count:  129051 avg:  4436.65

# sample rate: 1000
❯ sudo bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } ' | tee -a kepler-0.log
rt_ns: 1355770684 count:  481430 avg:  2816.13

I see that this happens because sampling is not honoured (on purpose) in https://github.com/sustainable-computing-io/kepler/blob/d3c2906c9abd0052d38e75c19e1577dead497e5c/bpf/kepler.bpf.c#L210-L213

@marceloamaral could you please confirm this is necessary? If this can't be optimised any further, lets close this as not a bug.

marceloamaral commented 2 weeks ago

Interesting discussion @sthaha @dave-tucker @rootfs

Regarding skipping the counter reset

The eBPF code is designed to collect hardware and software counters that accumulate for a specific process while it is using the CPU and collects the values during the sched_switch. The sched_switch indicates when a task is running and then transitions to another state (move to another CPU or goes to idle state).

For example, if we skip resetting the counters for five context switches, and tasks 1, 2, 3, 4, and 5 run on CPU 1 during this period, the counters will continue accumulating values. When we finally collect the counters while task 6 is running on CPU 1, the counters will reflect the combined values of all tasks (1 through 6) that have run on CPU 1.

This accumulation results in inaccurate counter values for task 6, as it includes data from all previous tasks. To have accuracy and consistency, it is crucial to not skip the counter reset, unless we properly handle it. Furthermore, failing to update the counters when a task leaves the CPU will also distort the measurement of CPU time for each task, not only the hardware counters.

Regarding having the sampling rate

Maybe there is another way to implement a sampling rate. The first time that we skip an iteration, we reset all the map entries so that it will need to start the collection again for CPU time and hardware counters. However, we need to ensure that after starting the collection again, we allow enough iterations to collect meaningful metrics. That is, it can calculate the deltas values from previous and current collection that were not skipped.
- We could flip the sampling rate counter's purpose by skipping X iterations and then collecting data for X iterations, and vice versa.
Another possibility would be to detach the eBPF program for a few milliseconds and then reattach it again. This approach would completely remove the eBPF overhead during the detached period and restart the data collection process, collecting metrics for another few milliseconds.
- But the process of detaching and reattaching the eBPF program itself may introduce some overhead.

vimalk78 commented 2 weeks ago

in the current way of sampling we sample the hardware counters each time but we use it only when sampling period ends (counter goes to 0), and discard all the samples which is adding to the cost.

if we use the sample only when the counter goes to zero, why not skip sampling in the whole period except for when counter is 1 (which will also be discarded because the delta would be wrong)

trying a patch with this change, and testing

sthaha commented 2 weeks ago

FYI: I have removed this from the 0.7.11 milestone. Lets address this issue in the 0.7.12 milestone.

dave-tucker commented 2 weeks ago

Per @marceloamaral's comments I think correctness is important, and therefore sampling is no longer a viable approach to reduce the overhead we're adding on every sched_switch event.

I've put together an alternative proposal here: https://github.com/sustainable-computing-io/kepler/issues/1611 which details how we could move some of this logic to userland, which would achieve both objectives:

Correctness
Lower probe overhead

sustainable-computing-io / kepler