hamadmarri / cacule-cpu-scheduler

The CacULE CPU scheduler is based on interactivity score mechanism. The interactivity score is inspired by the ULE scheduler (FreeBSD scheduler).
264 stars 32 forks source link

Sound interrupts during background operations #15

Closed xalt7x closed 3 years ago

xalt7x commented 4 years ago

I tried few latest version (5.4-r7 and and 5.8-r8). They both worked well until I started kernel compilation.5.8-r8 is probably better but I still have sound skips on Youtube music (poorly optimized web app to be honest). With CFS UI slows down but rarely has "xruns". With Cachy UI responsiveness seem better but at this point it's not suitable for "realtime" applications. It's not necesseraly has to be compilation, sound may interrupt during more generic user tasks such as system updates.

xalt7x commented 4 years ago

ssr-2020-10-31_15.58.41_edit.zip Here's video-demonstration of issue. Strangely it even recorded badly (near the end sound came back to normal but SimpleScreenRecorder (or audio Monitor) received it with interrupts) How to reproduce on KDE Plasma: 1) Downgrade some package so Discover will inform you about updates 2) Play some YT video (optionally force HD quality to increase CPU usage to make it more noticeable) 3) Click on tray Discover icon 4) Notice sound and video interrupts I'm using LTS distro (Kubuntu 20.04) and mobile CPU with disabled Hyperthreading and lots of kernel tweaks so results may vary.

hamadmarri commented 3 years ago

Hi @Alt37

What was the last revision that wasn't having this issue? I need to check what changes possibly makes this problem.

What is the hrrn_max_life? 30s?

Thank you

xalt7x commented 3 years ago

I tested only 2 last revisions (r7 with 5.4 and r8 with 5.8). Both of them have this issue. "hrrn_max_life" parameter had default value (30). I tried to increase it to 60 but it didn't help. As you can see from my video sound interrupts immediately when some other process stresses CPU.

hamadmarri commented 3 years ago

Is it possible to test r6 on v5.8? the changes between r7 and r8 are big. But since they both have this issue, I am thinking that might be the problem is in the small changes between r6 and r7.

xalt7x commented 3 years ago

Tried build with "cachy-5.8-r6.patch" (minus "sysctl_sched_nr_migrate" tweak). Nothing changes - frames and audio samples starts to drop immediately after I click on KDE Discover's "Updates" system tray icon.

hamadmarri commented 3 years ago
4\. I'm using LTS distro (Kubuntu 20.04) and mobile CPU with disabled Hyperthreading and lots of kernel tweaks so results may vary.

Hi @Alt37

Maybe some kernel tweaks affect the scheduler? I believe disabling Hyperthreading will increase security/reduce performance, but since it works fine with CFS, so I don't think it is an issue. What kind of other tweaks that are on fair.c scheduler? f-sync? or some sysctl_sched_latency changes?

What kind of hard drive is used? Sometimes slow HD would takes time to load tasks which increase the wait time, thus Cachy will run them over other running tasks.

Thanks

xalt7x commented 3 years ago

What kind of other tweaks that are on fair.c scheduler

sched_latency_ns=(sysctl kernel.sched_latency_ns / 6) * 4
sched_min_granularity_ns=sched_latency_ns/8
sched_wakeup_granularity_ns * 2.5

For dual-core CPU without HT it's

sysctl kernel.sched_latency_ns=8000000
sysctl kernel.sched_min_granularity_ns=1000000
sysctl kernel.sched_wakeup_granularity_ns=5000000

Also

sysctl kernel.sched_nr_migrate=128
sysctl kernel.sched_rt_runtime_us=800000

I tried to revert all those parameters to default values, tried to increase "sched_latency_ns" and "sched_hrrn_max_lifetime_ms", tried to boot with "nothreadirqs - nothing solved this problem completely. At this point I doubt that there's anything wrong on my end because with the same conditions CFS works absolutely fine in this regard.


I saw that you're also KDE Plasma user but on openSUSE distro. Did you try to reproduce it on your system?

hamadmarri commented 3 years ago

What kind of other tweaks that are on fair.c scheduler

sched_latency_ns=(sysctl kernel.sched_latency_ns / 6) * 4
sched_min_granularity_ns=sched_latency_ns/8
sched_wakeup_granularity_ns * 2.5

For dual-core CPU without HT it's

sysctl kernel.sched_latency_ns=8000000
sysctl kernel.sched_min_granularity_ns=1000000
sysctl kernel.sched_wakeup_granularity_ns=5000000

Also

sysctl kernel.sched_nr_migrate=128
sysctl kernel.sched_rt_runtime_us=800000

I tried to revert all those parameters to default values, tried to increase "sched_latency_ns" and "sched_hrrn_max_lifetime_ms", tried to boot with "nothreadirqs - nothing solved this problem completely. At this point I doubt that there's anything wrong on my end because with the same conditions CFS works absolutely fine in this regard.

I saw that you're also KDE Plasma user but on openSUSE distro. Did you try to reproduce it on your system?

I am running youtube right now, and downloading (567M) update through discover without any interruption. Every thing is smooth. Usually I use zypper to update, both zypper and discover don't cause any freezes on my machine.

hamadmarri commented 3 years ago

What kind of other tweaks that are on fair.c scheduler

sched_latency_ns=(sysctl kernel.sched_latency_ns / 6) * 4
sched_min_granularity_ns=sched_latency_ns/8
sched_wakeup_granularity_ns * 2.5

For dual-core CPU without HT it's

sysctl kernel.sched_latency_ns=8000000
sysctl kernel.sched_min_granularity_ns=1000000
sysctl kernel.sched_wakeup_granularity_ns=5000000

Also

sysctl kernel.sched_nr_migrate=128
sysctl kernel.sched_rt_runtime_us=800000

I tried to revert all those parameters to default values, tried to increase "sched_latency_ns" and "sched_hrrn_max_lifetime_ms", tried to boot with "nothreadirqs - nothing solved this problem completely. At this point I doubt that there's anything wrong on my end because with the same conditions CFS works absolutely fine in this regard.

I saw that you're also KDE Plasma user but on openSUSE distro. Did you try to reproduce it on your system?

The mini freezes on Cachy is caused because there are tasks waited so long compared to other running tasks. Cachy will pick those waited tasks and favor them over other tasks (to enhance responsiveness) but sometime I/O tasks waited so long and there is no way to tell Cachy that those tasks are not interactive tasks. Some tasks wait and run only one time and then new threads created, so it leave no tracking option for Cachy. I wounder what causes the long delay for I/O (either HD or Network I am guessing) on your setup?

xalt7x commented 3 years ago

The mini freezes on Cachy is caused because there are tasks waited so long compared to other running tasks. Cachy will pick those waited tasks and favor them over other tasks.

Looks like on my machine newer tasks (Discover/packagekitd) are more "favored" than Chromium ones. The only thing I could try is to rebuild kernel without other patches using config similar to yours (if you don't mind to upload it here).

hamadmarri commented 3 years ago

Sure config-5.9.1-1-default.zip

xalt7x commented 3 years ago

Unfortunately on my system it's reproducible even with "generic" kernel configuration (HZ_250, PREEMPT_VOLUNTARY).

hamadmarri commented 3 years ago

The only thing that I am guessing the cause of the issue is FAIR_GROUP. I am not using fair_group in my config, I disabled it. Can you please try with FAIR_GROUP disabled? If disabling FAIR_GROUP solved the interrupts then I think I have a bug in Cachy with FAIR_GROUP.

Thanks

xalt7x commented 3 years ago

I'm confused...
menuconfig/nconfig allows to disable FAIR_GROUP_SCHED (General Setup > Control Group support > CPU Controller > Group scheduling for SCHED_OTHER)
only after CONFIG_SCHED_AUTOGROUP (General Setup > Automatic process group scheduling) is unselected But your config has CONFIG_SCHED_AUTOGROUP disabled and FAIR_GROUP_SCHED enabled...

hamadmarri commented 3 years ago

I'm confused... menuconfig/nconfig allows to disable FAIR_GROUP_SCHED (General Setup > Control Group support > CPU Controller > Group scheduling for SCHED_OTHER) only after CONFIG_SCHED_AUTOGROUP (General Setup > Automatic process group scheduling) is unselected But your config has CONFIG_SCHED_AUTOGROUP disabled and FAIR_GROUP_SCHED enabled...

Selecting Autogroup will automatically selects fairgroup, but not vice versa. autogroup needs fairgroup to work at its best. You can disable both autogroup and fairgroup, enable both, or disable autogroup and enable fairgroup, but you can't disable fair_group and keep autogroup enabled as I see in the kconfigs.

Please try with both disabled, I hope this will fix the issue so we know what caused the interrupts.

Thanks

hamadmarri commented 3 years ago

The only thing that I am guessing the cause of the issue is FAIR_GROUP. I am not using fair_group in my config, I disabled it. Can you please try with FAIR_GROUP disabled? If disabling FAIR_GROUP solved the interrupts then I think I have a bug in Cachy with FAIR_GROUP.

Thanks

Well, usually I disable it. IDK how I didn't disable it with this build. Sorry about that.

xalt7x commented 3 years ago

So I tried with FAIR_GROUP_SCHED disabled. I guess it helps but doesn't eliminate the problem. Sound interrupts still happen. Not immediately like before (when I launched Discover for updates) but sometimes when CPU usage increases.

hamadmarri commented 3 years ago

Could you please try this patch cacule5.9-r9.zip

Please let me know if you need a patch on specific version to try.

I really hope that this problem is solved because the mini-freezes are existed since cachy-r1 where the problem disappeared from my machine but still existed on some others machine when they use Chromium browser or in some different cases. The root of the problem is HRRN. In the attached patch I replaced HRRN with different policy (idea and math equation are taken from FreeBSD ULE scheduler) - interactivity score.

Thank you

raykzhao commented 3 years ago

Hi @hamadmarri @Alt37

It seems that CacULE scheduler works best for me after enabling full preemption and removing all the scheduler tweaks from zen-kernel, xanmod, etc.

hamadmarri commented 3 years ago

Hi @hamadmarri @Alt37

It seems that CacULE scheduler works best for me after enabling full preemption and removing all the scheduler tweaks from zen-kernel, xanmod, etc.

Hi @raykzhao

Please let me know if CacULE has better response time with/without heavy load (whether mini-freezes, sound interrupts resolved). And also the overall performance compared to Cachy and CFS.

Thank you so much

xalt7x commented 3 years ago

Tried to rebuild Ubuntu's version "5.9.0-2" (based on 5.9.0) with latest patch of CacULE (2020-12-08). With patch it fails too boot (hangs right after GRUB), without CacULE - loads fine. Config has nothing special to my previous rebuilds (tickless kernel, 500HZ, RCU_BOOST_DELAY, BFQ by default) config-5.9.0-2-cacule-2020-12-08.tar.gz

hamadmarri commented 3 years ago

Tried to rebuild Ubuntu's version "5.9.0-2" (based on 5.9.0) with latest patch of CacULE (2020-12-08). With patch it fails too boot (hangs right after GRUB), without CacULE - loads fine. Config has nothing special to my previous rebuilds (tickless kernel, 500HZ, RCU_BOOST_DELAY, BFQ by default) config-5.9.0-2-cacule-2020-12-08.tar.gz

Hi @Alt37

Could you please try without CONFIG_SCHED_AUTOGROUP

xalt7x commented 3 years ago

Hi @hamadmarri

Could you please try without CONFIG_SCHED_AUTOGROUP

As expected, that didn't help. Boot still stuck at "Loading initial ramdisk" and I can't find any information in /var/log

hamadmarri commented 3 years ago

It related to FAIR_GROUP

I got this error in qemu when enabled FAIR_GROUP

[    0.474757] BUG: kernel NULL pointer dereference, address: 0000000000000000
[    0.475304] #PF: supervisor read access in kernel mode
[    0.475466] #PF: error_code(0x0000) - not-present page
[    0.475721] PGD 0 P4D 0 
[    0.475916] Oops: 0000 [#1] SMP NOPTI
[    0.475916] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.12+ #1
[    0.475916] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[    0.475916] RIP: 0010:pick_next_entity.isra.0+0xc/0x80
[    0.475916] Code: d2 49 89 c0 48 89 c8 49 f7 f0 44 01 d8 44 29 d0 c1 f8 1f 83 e0 02 83 e8 01 c3 0f 1f 40 00 41 55 41 54 49 89 f4 55 48 89 fd 53 <48> 8b 1f e8 5c 12 f9 ff 49 89 c5 48 85 db 74 27 48 8b 4b 10 48 8b
[    0.475916] RSP: 0018:ffffc9000000bd18 EFLAGS: 00000046
[    0.475916] RAX: 0000000000000000 RBX: ffff88807dc29100 RCX: ffff88807d5400f8
[    0.475916] RDX: 0000000000000000 RSI: ffff88807d530080 RDI: 0000000000000000
[    0.475916] RBP: 0000000000000000 R08: 000000001c3806f8 R09: 0000000000000001
[    0.475916] R10: 0000000000000590 R11: 00000000000001df R12: ffff88807d530080
[    0.475916] R13: ffffc9000000bd88 R14: ffff88807dc29180 R15: ffff88807dc29180
[    0.475916] FS:  0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[    0.475916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.032134] smpboot: CPU 1 Converting physical 0 to logical die 1
[    0.475916] CR2: 0000000000000000 CR3: 000000000240a000 CR4: 00000000000006f0
[    0.475916] Call Trace:
[    0.475916]  pick_next_task_fair+0xb6/0x340
[    0.475916]  __schedule+0xf5/0x6d0
[    0.475916]  schedule+0x45/0xb0
[    0.475916]  native_cpu_up+0x346/0x620
[    0.475916]  ? cpuhp_kick_ap+0xd0/0xd0
[    0.475916]  bringup_cpu+0x26/0xc0
[    0.475916]  ? cpuhp_kick_ap+0xd0/0xd0
[    0.475916]  cpuhp_invoke_callback+0x95/0x510
[    0.475916]  _cpu_up+0xa0/0x130
[    0.475916]  cpu_up+0x6f/0x90
[    0.475916]  bringup_nonboot_cpus+0x43/0x50
[    0.475916]  smp_init+0x21/0x5f
[    0.475916]  kernel_init_freeable+0xb0/0x1ce
[    0.475916]  ? rest_init+0x95/0x95
[    0.475916]  kernel_init+0x5/0xfb
[    0.475916]  ret_from_fork+0x22/0x30
[    0.475916] Modules linked in:
[    0.475916] CR2: 0000000000000000
[    0.475916] ---[ end trace 7e5cca4425e9453b ]---
[    0.475916] RIP: 0010:pick_next_entity.isra.0+0xc/0x80
[    0.475916] Code: d2 49 89 c0 48 89 c8 49 f7 f0 44 01 d8 44 29 d0 c1 f8 1f 83 e0 02 83 e8 01 c3 0f 1f 40 00 41 55 41 54 49 89 f4 55 48 89 fd 53 <48> 8b 1f e8 5c 12 f9 ff 49 89 c5 48 85 db 74 27 48 8b 4b 10 48 8b
[    0.475916] RSP: 0018:ffffc9000000bd18 EFLAGS: 00000046
[    0.475916] RAX: 0000000000000000 RBX: ffff88807dc29100 RCX: ffff88807d5400f8
[    0.475916] RDX: 0000000000000000 RSI: ffff88807d530080 RDI: 0000000000000000
[    0.475916] RBP: 0000000000000000 R08: 000000001c3806f8 R09: 0000000000000001
[    0.475916] R10: 0000000000000590 R11: 00000000000001df R12: ffff88807d530080
[    0.475916] R13: ffffc9000000bd88 R14: ffff88807dc29180 R15: ffff88807dc29180
[    0.475916] FS:  0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
[    0.475916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.475916] CR2: 0000000000000000 CR3: 000000000240a000 CR4: 00000000000006f0
[    0.475916] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[    0.475916] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
qemu-system-x86_64: terminating on signal 2

Sorry it is my bad, I will fix it soon.

hamadmarri commented 3 years ago

Please check this fix: https://github.com/hamadmarri/cacule-cpu-scheduler/blob/ce293ce0d324e1b35ae534ae3281871ee1c19cfc/patches/CacULE/v5.9/cacule5.9.patch

xalt7x commented 3 years ago

@hamadmarri Fair comparison requires identical configs. Do I need to disable CONFIG_SCHED_AUTOGROUP and FAIR_GROUP_SCHED for both builds (with & without CacULE) ?

hamadmarri commented 3 years ago

@hamadmarri Fair comparison requires identical configs. Do I need to disable CONFIG_SCHED_AUTOGROUP and FAIR_GROUP_SCHED for both builds (with & without CacULE) ?

CacULE (after the above fix) would work with CONFIG_SCHED_AUTOGROUP and FAIR_GROUP_SCHED However, I prefer disabling both since fair/auto groups are so specific to CFS to enhance the latency. Cachy/CacULE use policies that considers user interactivity/latency by the nature of HRRN or Interactivity score. Therefore, enabling fair/auto group in CacULE, will propably not provide any more responsiveness or interactivity, it will just add more overhead processing/updating fair groups data.

Based on my previous Cachy testing and examining (with this test: https://github.com/hamadmarri/os-scheduler-responsiveness-test), I didn't have any gain in responsiveness when enabling auto/fair group. So, I assume it is not needed.

For fair comparisons, I think it is good to compare the best of both i.e. CFS with fair/auto groups, CacULE with or without (whatever is best on your machine). Having said that, the other tuning stuff such are latency_ns and other variables related to load balancing should be the same.

hf29h8sh321 commented 3 years ago

CacULE has some mini-freezes on my machine.

raykzhao commented 3 years ago

@hamadmarri @hf29h8sh321

Unfortunately it seems that the mini-freeze is more noticable than Cachy scheduler when under heavy load on my machine e.g. compiling the kernel in background.

I think it's probably better to let the users/kernel maintainers select between Cachy/CacULE scheduler in kconfig, similar to how PDS/BMQ schedulers did, see #19.

hamadmarri commented 3 years ago

Could you please try this patch on top of commit: 1dd9bff04302d65999dbfe3fed53c08ec957525b

Please disable FAIR_GROUP and AUTOGROUP since cachy/cacule don't need it.

On my tests on https://www.youtube.com/watch?v=LXb3EKWsInQ with 1080p60 and compiling linux kernel with make -j6 (on a 4 CPUs machine)

CFS: many mini-freezes and sometimes the video freezes with audio running CacULE: some mini-freezes but never pauses the video CacULE with the attached patch: few mini-freezes (hard to spot) and never pauses the video Cachy5.9: very similar to CacULE but few times pauses the video

Those kernels have exact .config of opensuse tumbleweed defaults, except fair_group is disabled.

Please let me know if you got any enhancement with this patch smoother.zip

hf29h8sh321 commented 3 years ago

CacULE with the smoother patch has audio interruptions under load, more than original cachy. I found that the now deleted rdb branch from the kernel tree has the best results, with only occasional stuttering under load.

raykzhao commented 3 years ago

Hi @hamadmarri

CacULE with the smoother patch seems to be better than both the original Cachy scheduler and CacULE without the smoother patch under heavy load on my machine. Thank you for the great work!

hamadmarri commented 3 years ago

patch2.zip Could you please try this patch instead of smoother.

On top of commit: 1dd9bff04302d65999dbfe3fed53c08ec957525b

While I am working on making a global queue, I noticed starvation. It turned out that for every task migration, the task run time resets. IDK how I couldn't notice this until now.

Please let me know if it is better with this little patch.

Thank you

raykzhao commented 3 years ago

Hi @hamadmarri

The new patch seems to be as smooth as the previous smoother patch on my machine. Thank you!

raykzhao commented 3 years ago

Hi @hamadmarri

I guess the same starvation bug also exists in the original Cachy scheduler, so I just tried the original Cachy scheduler (without idle-balance) with the fix. It also seems to make the original Cachy scheduler smoother. Now both Cachy and CacULE with the fix feel similar on my machine under heavy load. Not sure which scheduler performs better during microbenchmarks.

hamadmarri commented 3 years ago

I hope the last patch fixes the freezing issues to everyone. I have updated the patch in this commit: https://github.com/hamadmarri/cacule-cpu-scheduler/commit/de32c14a813397c998368cadadf567e9b0417718

Please reopen this issue if the problem is still exist.

Thank you