High Kepler CPU usage under normal workloads

vimalk78 commented 1 month ago

Without any load on system, kepler CPU usage goes upto 20%

vimalk78 commented 1 month ago

https://github.com/sustainable-computing-io/kepler/issues/1660#issuecomment-2265665980

vimalk78 commented 1 month ago

on latest main, if machine is loaded with stress-ng, the kepler cpu usage spikes. In comparison, the kepler before ringbuffer does not show increase in cpu if machine is loaded.

vimalk78 commented 1 month ago

comparing with old code, some kepler cpu usage spike is understandable since some processing ( 3 map lookup, 2 update, 1 delete) was happening in kernel context and cpu cycles for these were accounted for in the kernel, which now happens in user space and gets counted as kepler cpu.

need to check if we can reduce the cpu spike in kepler when machine is loaded.

dave-tucker commented 1 month ago

need to check if we can reduce the cpu spike in kepler when machine is loaded.

exactly! I'm now able to reproduce with stress-ng and I'm working to keep that CPU spike as low as possible.

rootfs commented 1 month ago

@dave-tucker can you create a feature branch, move the code there, and revert the related commits?

vimalk78 commented 1 month ago

i ran some perf stat tests to check impact of kepler on context switch time. idea being that since kepler traps sched_switch and does some processing, it should have some impact on context switch time. stress-ng is used in parallel to simulate load.

without running kepler

root@bkr18:~# sudo perf stat -a -e sched:sched_switch --timeout 600000 # with no kepler with load

Performance counter stats for 'system wide':

    90,480,301      sched:sched_switch                                                    

 600.105927296 seconds time elapsed

with running kepler release-0.7.11

root@bkr18:~# sudo perf stat -a -e sched:sched_switch --timeout 600000 # with kepler 0.7.11 with load

Performance counter stats for 'system wide':

    87,500,721      sched:sched_switch                                                    

 600.100293869 seconds time elapsed

with running kepler latest (with ring buffer )

root@bkr18:~# sudo perf stat -a -e sched:sched_switch --timeout 600000 # with kepler latest with load

 Performance counter stats for 'system wide':

        79,620,228      sched:sched_switch                                                    

     600.099929726 seconds time elapsed

Observation: with kepler running, the number of context switches goes down, as expected. But with ring-buffer changes, the drop is more than 7-11 release.

Test is run on a bare-metal machine with almost no other load.

stress-ng command: stress-ng --cpu 8 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 11m

sustainable-computing-io / kepler

High Kepler CPU usage under normal workloads #1670