Closed marioroy closed 3 months ago
The README states, "All tasks in a CPU have a shared quota = 105us in which every task runs (105us / # of tasks)". In the case not fixed value, how is the new ECHO-002 bs_shared_quota
computed? 50 base_slice_ns * 8 CPUs * 1.25 = 500
I will try bs_shared_quota
4000, 5000, and 6000: 50 base_slice_ns * 64 CPUs * 1.25 = 4000
.
/sys/kernel/debug/sched/base_slice_ns 50
/proc/sys/kernel/sched_bs_shared_quota 4000
Edit: Interesting, 60fps for the WebGL blob demo, previously 56. I'm testing 6000 first.
Hello @marioroy
Sorry for late response. The shared quota is per cpu, so simply it is the maximum amount of nano seconds the running tasks per cpu have to share in one round. The smallest the value the smoothest but more context switches. The minimum value must be no less than 2x base_slice_ns
assuming you consider roughly two tasks per cpu running at a time so if base_slice_ns==500, the shared_quota minimum is 1000. In my machine the sweet spot was bs_shared_quota=500
ns and the base_slice_ns I just hard coded to 50 ns.
Thank you for sharing the test results. If you don't mind sharing it here https://github.com/hamadmarri/benchmarks
Could you please explain a bit on the results, or maybe just mention which is more is best or less is best.
Thank you
The shared quota is per cpu
I had wondered about bs_shared_quota
since using ECHO. That is now clear.
Could you please explain a bit on the results, or maybe just mention which is more best or less best.
I struggled choosing BORE v5.0.3 or ECHO v001. BORE is more responsive under CPU load; for example launching Firefox. The window appears in less than 1 second. That is possible with ECHO (~ 1 second) by running the background CPU burner with idle policy i.e. chrt -i 0
. For the WebGL blog demonstration. ECHO too, can reach 60fps under CPU load by running Chrome with 'fifo' policy i.e. chrt -f 10
.
ECHO completes the CPU burner job in less time, counting prime numbers.
Your video is where I learned about the WebGL Blog demonstration. Now, there is ECHO v002. It will take some time to do various testing, including bs_shared_quota=500
.
Thank you for the explanation, @hamadmarri.
Maybe the sched_base_slice
or bs_shared_quota
v002 defaults are extreme. Try running steps 1 and 2 concurrently. Launching Firefox may freeze the entire desktop momentarily.
Repeat: Quit Chrome and run normally, without fifo policy.
The freeze issue is a problem. I experienced the (1 ~ 2 seconds) freeze two more times, using HZ_625
, and again with HZ_800
. I like HZ_800
for the improved interactivity. A higher base_slice_ns
mitigates jitters. Thank you for allowing tuning.
/sys/kernel/debug/sched/base_slice_ns 3500
/proc/sys/kernel/sched_bs_shared_quota 35000
I reverted the following v002 change back to RR_TIMESLICE (100 * HZ / 1000)
. Interestingly, I had no freezes before with ECHO v001.
+#ifdef CONFIG_ECHO_SCHED
+#define RR_TIMESLICE (1)
+#else
#define RR_TIMESLICE (100 * HZ / 1000)
+#endif
Edit: That did it. No freezes, and running HZ_800
. About base_slice_ns
. I tried going lower, but jitters came back launching Firefox and watching the "slowroads" demo. Another test involves lots of memory. Decreasing bs_shared_quota
below 35000 causes "write stdout" to take 1.2 ~ 1.8 seconds. Likely cache misses. So, bs_shared_quota = 35000
it is for my machine.
$ ./llil4emh in/big* in/big* in/big* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort
get properties 5.910 secs
map to vector 0.879 secs
vector stable sort 1.132 secs
write stdout 0.970 secs <--- here
total time 8.892 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Interesting! I have to revert the RR_TIMESLICE changes and will think of new default values for both base_slice_ns and bs_shared_quota. Thank you so much for the testing and debugging
Hello @marioroy
https://github.com/hamadmarri/ECHO-CPU-Scheduler/commit/4a8cd2a52a3ef7056c3a77ba99d13ecb364454af
I have done some tests and 35us is also a better value in my machine too https://openbenchmarking.org/result/2404031-NE-DEFAULTVS23
Thank you so much :+1:
ECHO loves bs_shared_quota
35000. Go much higher, no good. Go much lower, no good. That seems to be spot on. I tried tuning the base_slice_ns
setting (default 6000)?
/sys/kernel/debug/sched/base_slice_ns 4200
/proc/sys/kernel/sched_bs_shared_quota 35000
This looks mystical.
35000 / 4200 = 8.3(3)
35000 / 3 = 11.6(6)
Does base_slice_ns
4200 work well on your system? That is safely the lowest one can go to not cause jitters.
Hackbench wall clock time dropped from 40 seconds (base_slice_ns
6000) down to 37.5 seconds (base_slice_ns
4200); under CPU load (counting prime numbers), and cyclictest concurrently.
Hi @marioroy
https://openbenchmarking.org/result/2404043-NE-DEFAULTVS48
In cpu bound tasks, the 4200 is the best so far (see Rust Mandelbrot test). The interactivity overall is better.
Thank you for you efforts
A Clear Linux user tried my ClearMod repository and compared the Vanilla native kernel (no preemption) and ECHO (XanMod + preemption + ECHO).
https://community.clearlinux.org/t/nvidia-and-xanmod-cl-updates/9299/32
Very cool.
I completed testing a demo for the phmap author. Yet, another surprise. :-)
Hi @marioroy
Thank you for sharing the results. I am pleased to see that echo has some performance advantages :+1:
Hi, @hamadmarri
Thank you for ECHO. I captured results comparing EEVDF, BORE, and ECHO. I now realize you made an update and will run again, and report back. Testing was on a 32-core box (64 CPU threads); AMD Ryzen Threadripper 3970X; NVIDIA RTX 3070; XanMod Edge 6.8.2 kernel. I'm unsure if
bs_shared_quota
is fixed or depends on the number of CPU threads?ECHO tuning
I ran 4 tasks concurrently, twice (with and without idle policy for the compute job). Afterwards, I timed a kernel compile job.
Results
Observations
Blessings and grace.