Open rootfs opened 2 weeks ago
@sthaha @dave-tucker can you get this fixed before 0.7.11 release? thanks
@rootfs I ran the same on my machine and here are the results
# no sampling
❯ sudo bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } ' | tee -a kepler-0.log
rt_ns: 572554527 count: 129051 avg: 4436.65
# sample rate: 1000
❯ sudo bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } ' | tee -a kepler-0.log
rt_ns: 1355770684 count: 481430 avg: 2816.13
I see that this happens because sampling is not honoured (on purpose) in https://github.com/sustainable-computing-io/kepler/blob/d3c2906c9abd0052d38e75c19e1577dead497e5c/bpf/kepler.bpf.c#L210-L213
@marceloamaral could you please confirm this is necessary? If this can't be optimised any further, lets close this as not a bug.
Interesting discussion @sthaha @dave-tucker @rootfs
Regarding skipping the counter reset
The eBPF code is designed to collect hardware and software counters that accumulate for a specific process while it is using the CPU and collects the values during the sched_switch
. The sched_switch
indicates when a task is running and then transitions to another state (move to another CPU or goes to idle state).
For example, if we skip resetting the counters for five context switches, and tasks 1, 2, 3, 4, and 5 run on CPU 1 during this period, the counters will continue accumulating values. When we finally collect the counters while task 6 is running on CPU 1, the counters will reflect the combined values of all tasks (1 through 6) that have run on CPU 1.
This accumulation results in inaccurate counter values for task 6, as it includes data from all previous tasks. To have accuracy and consistency, it is crucial to not skip the counter reset, unless we properly handle it. Furthermore, failing to update the counters when a task leaves the CPU will also distort the measurement of CPU time for each task, not only the hardware counters.
Regarding having the sampling rate
Maybe there is another way to implement a sampling rate. The first time that we skip an iteration, we reset all the map entries so that it will need to start the collection again for CPU time and hardware counters. However, we need to ensure that after starting the collection again, we allow enough iterations to collect meaningful metrics. That is, it can calculate the deltas values from previous and current collection that were not skipped.
Another possibility would be to detach the eBPF program for a few milliseconds and then reattach it again. This approach would completely remove the eBPF overhead during the detached period and restart the data collection process, collecting metrics for another few milliseconds.
in the current way of sampling we sample the hardware counters each time but we use it only when sampling period ends (counter goes to 0), and discard all the samples which is adding to the cost.
if we use the sample only when the counter goes to zero, why not skip sampling in the whole period except for when counter is 1 (which will also be discarded because the delta would be wrong)
trying a patch with this change, and testing
FYI: I have removed this from the 0.7.11 milestone. Lets address this issue in the 0.7.12 milestone.
Per @marceloamaral's comments I think correctness is important, and therefore sampling is no longer a viable approach to reduce the overhead we're adding on every sched_switch
event.
I've put together an alternative proposal here: https://github.com/sustainable-computing-io/kepler/issues/1611 which details how we could move some of this logic to userland, which would achieve both objectives:
What happened?
Latest the
EXPERIMENTAL_BPF_SAMPLE_RATE
in kepler ebpf code no longer shows any difference in reducing ebpf overhead. Note, in the latest, the map collection is still executed even sampling is enabled. This is not the case in 0.7.8 code.I compared 0.7.8 and latest code and calculated the ebpf per call overhead in ns. In 0.7.8 code, when setting![image](https://github.com/sustainable-computing-io/kepler/assets/7062400/ac2bc532-01f8-45b6-856a-5ddf289aa0e1)
EXPERIMENTAL_BPF_SAMPLE_RATE
to 1000, the per call overhead is reduced from 12043.3 ns to 141.279 ns. While in latest code, there is almost no difference (10187.8 ns vs 10100 ns).What did you expect to happen?
EXPERIMENTAL_BPF_SAMPLE_RATE should be able to reduce ebpf overhead
How can we reproduce it (as minimally and precisely as possible)?
Run the bpftool command on kepler latest and 0.7.8
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)