hamadmarri / cacule-cpu-scheduler

The CacULE CPU scheduler is based on interactivity score mechanism. The interactivity score is inspired by the ULE scheduler (FreeBSD scheduler).
266 stars 32 forks source link

Experiencing some random hangs under heavy workload #47

Open ltsdw opened 3 years ago

ltsdw commented 3 years ago

I've been experiencing these hangs (where everything freezes for like 5 secs) when playing some games on wine that usually uses a lot of the CPU, sometimes when watching some videos.

To be sure that was cacule patch and nothing else I tested with the mainline arch kernel (no hangs). As I have some patches applied at my kernel I tried compiling it without the cacule patch (also no hangs). And then tried applying the cacule again and the hangs comes back.

I'm not quite sure. But I think that the commit that introduced it is the 06cb3974.

I didn't tried reverting the commit to test, only tested with these:

cacule-patch-with-hangs.txt - patch where hangs happens

cacule-without-hangs.txt - and without the hangs

But if needed I can try bisecting later to see exactly which commit causes it.

ltsdw commented 3 years ago

Also all the tunable configs are the default.

raykzhao commented 3 years ago

Hi @ltsdw

Based on https://github.com/hamadmarri/cacule-cpu-scheduler/discussions/43, Have you tried to reduce the kernel.sched_cache_factor to a lower value e.g. 0? Also from my experience, you may try to set the kernel.sched_cacule_yield to 0 since it may cause freeze due to some I/O issues, see https://github.com/hamadmarri/cacule-cpu-scheduler/issues/35.

ltsdw commented 3 years ago

Hi there @raykzhao

Thank you for your suggestion I'll try.

ltsdw commented 3 years ago

sadly it didn't worked, tried: kernel.sched_cache_factor=0 kernel.sched_cacule_yield=0

but the hangs still.

hamadmarri commented 3 years ago

kernel.sched_cache_factor=0

Could you please also set kernel.sched_starve_factor=0

Is RDB enabled?

ltsdw commented 3 years ago

Could you please also set kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

hamadmarri commented 3 years ago

Could you please also set kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

Could you please try without RDB?

ltsdw commented 3 years ago

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

hamadmarri commented 3 years ago

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

ltsdw commented 3 years ago

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

Sure this one here was from my last compile on 5.13.8 config.txt.

CPU: i5 5200U
GPU: Intel(R) HD Graphics 5500 (using iris)
RAM: 8 GB
Mesa: 21.3.0 (commit c0fc745b78b)
Wine: 6.13 (with some patches from tkg)
Games that I tested with: NovaRO, GTA San Andreas, Path of Exile (this one I'll blame my gpu more than anything else), but it also happen out of nowhere when watching some videos too, or when I'm compiling something.

and when you say settings, you say which ones? the cacule's ones? if it's, it's all the default.

Now let me recompile it, will take some time.

JohnyPeaN commented 3 years ago

I have such lags in rdr2 (only) and setting kernel.sched_interactivity_factor=50 seems to be helping. It doesnt happen without RDB, but without RDB background load has stronger negative effects. I will test kernel.sched_starve_factor=0, too.

ltsdw commented 3 years ago

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

hamadmarri commented 3 years ago

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

Hi @ltsdw ,

Good to hear it's working fine now, however, I really would like to troubleshoot why RDB causes these freezes.

Regarding tunning, there is no specific way to test. I tried to make the defaults to work fine in general, but when you have any issue you can change them. You need to have a background on cpu scheduling so you can read about the every cacule sysctl and change them accordingly.

I would like to keep this issue open until we see why RDB performs bad with wine.

Thank you

hamadmarri commented 3 years ago

I suspect it is related to rcu calls and soft irq. I will post some fixes to try soon.

Thank you

JohnyPeaN commented 3 years ago

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

hamadmarri commented 3 years ago

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

Hi @JohnyPeaN , @ltsdw

To narrow down the troubleshooting, could you please try RDB with: CONFIG_HZ_PERIODIC=y to see if it is actually related to nohz{idle, full} balancing? I remember I had nohz_balancer_kick(rq); added in RDB before, but for some reasons that I forgot why I removed it from RDB trigger_load_balance function.

Also, can you try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Or try vise versa, in cause you have most rcu configs are disabled try to enable them.

Based on my RDB code review I have just did 2min ago, I am suspecting it is because nohz balancing. I am assuming that you are using no_hz_full?

Please let me know if any of the above changes fix the freezes so I can propose a fix based on your feedback. If non of the above configs has any positive effects, then I can investigate something else.

Thank you

JohnyPeaN commented 3 years ago

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

hamadmarri commented 3 years ago

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Hi @JohnyPeaN

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

Thank you

ltsdw commented 3 years ago

@hamadmarri

ok, I'll try too, but I'll need some time, thank you!

ltsdw commented 3 years ago

@hamadmarri

while compiling I noticed this:

kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance' [-Werror,-Wimplicit-function-declaration]
                nohz_newidle_balance(this_rq);
                ^
kernel/sched/fair.c:11324:3: note: did you mean 'nohz_run_idle_balance'?
kernel/sched/sched.h:2439:20: note: 'nohz_run_idle_balance' declared here
static inline void nohz_run_idle_balance(int cpu) { }
                   ^
1 error generated.
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make[1]: *** Waiting for unfinished jobs....

and the building failed.

ltsdw commented 3 years ago

Nah, I think it was my fault, let me try again.

ltsdw commented 3 years ago

strange, kernel/sched/fair.c, in fact has a declaration of nohz_newidle_balance at line 11050. actually I don't know what possible wrong here. Why when called at line 11324 it's not seeing it?

ltsdw commented 3 years ago

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

hamadmarri commented 3 years ago

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

ltsdw commented 3 years ago

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

@hamadmarri

But now there is a compile error happening kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance'

raykzhao commented 3 years ago

Hi @ltsdw @hamadmarri

I think the compiling error is because the nohz_newidle_balance is not defined when CONFIG_NO_HZ_COMMON=n and CONFIG_CACULE_RDB=y. Please try the following fix:

--- a/kernel/sched/fair.c   2021-08-18 22:39:26.513174343 +1000
+++ b/kernel/sched/fair.c   2021-08-18 22:38:19.322803092 +1000
@@ -11084,9 +11084,9 @@
 {
    return false;
 }
+#endif

 static inline void nohz_newidle_balance(struct rq *this_rq) { }
-#endif

 #endif /* CONFIG_NO_HZ_COMMON */

fix.patch.zip

ltsdw commented 3 years ago

@hamadmarri @raykzhao

Ok, I tested with CONFIG_HZ_PERIODIC=y and at least for me the hangs still. Now I'll try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Just a question, should I still use the CONFIG_HZ_PERIODIC=y or not?

JohnyPeaN commented 3 years ago

@hamadmarri CONFIG_HZ_PERIODIC=y removes the random lags and game is smooth even with RDB. Tried also the other suggested config options, but nothing noticeable happened.

ltsdw commented 3 years ago

@hamadmarri @JohnyPeaN

Just tested with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

and also didn't work, the hangs still happening for me.

raykzhao commented 3 years ago

Hi @ltsdw

Another thing I would suspect is the autogroup. Have you tried to disable the autogroup? You may try to add noautogroup in your kernel boot command-line parameter.

ltsdw commented 3 years ago

hi @raykzhao

thank you for your suggestion, I'll try it out later as I cannot right now

JohnyPeaN commented 3 years ago

@raykzhao I have autogroup enabled, but before CONFIG_HZ_PERIODIC=y even disabling it didn't help. I was using no_hz_full, but lately without kernel commandline parameter, which if I understand correctly, results in no_hz_idle.

ltsdw commented 3 years ago

@hamadmarri @raykzhao

So I just tested with noautogroup and the hangs are gone again.

To summarize, for me, neither:

CONFIG_HZ_PERIODIC=y

nor:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

worked so far, but disabling autogroup did the trick. Thank you!

hamadmarri commented 3 years ago

Could you please try this fix

rdb-nohz-fix.zip

Please test with either no_hz_idle or no_hz_full

Thank you

ltsdw commented 3 years ago

@hamadmarri

Just tested the fix, compiled with no_hz_full, and the hangs persists.

ltsdw commented 3 years ago

also, I don't know if it's relevant, but I noticed that when the hangs (freezes) happen, the cpu usage usually drops to 5-10%, from like 60-70%. In other words it drops from 70% to 5% and hangs for like 5-10 seconds and comes back to the normal cpu usage before the hang (around the ~70%).

JohnyPeaN commented 3 years ago

@hamadmarri can confirm. Lags still happening. Although it seems that they are much shorter, but still noticeable. With periodic ticks, its completely fluid for me.

MoisesMH commented 3 years ago

Hi there. I was wondering if having the option CONFIG_NO_HZ=y enabled is really necessary to activate the option CONFIG_NO_HZ_IDLE=y. I've read the first one is really used for older kernels, but, since I'm running 5.13, I guess it's not necessary at all. Also, I've read CONFIG_NO_HZ in recent kernels has divided in CONFIG_NO_HZ_IDLE, CONFIG_HZ_PERIODIC and CONFIZ_NO_HZ_FULL. I've though on running CONFIG_HZ_PERIODIC=y, but it'd drain unnecessary energy from the cpu, even if it's idle. These are the sources I've read:

https://www.linuxquestions.org/questions/linux-kernel-70/timer-tick-handling-4175468487/ https://github.com/torvalds/linux/blob/master/kernel/time/Kconfig

Also, I've read reducing the timer frequency could improve the performance of a kernel. I'm currently running at CONFIG_HZ_1000=y, but I can try RDB with a lower number (CONFIG_HZ_500=y) to see if I notice an improvement.

On the other hand, thanks for your scheduler. It's running incredibly smoother than cfs. Also, the cpu usage is reduced by a lot and the framerates are solid in resource-hungry games. I'm here because I've experienced the same issue: hangs happening at Star Wars Battlefront II each 5 secs at average, since I've activated the RDB feature. I'll stay tuned for improvements, since the RDB feature is really interesting. Also, sorry if my question is something obvious, but could you explain me how the rdb interval works, and what's the difference between running at a lower and a higher interval. I'd appreciate that.

hamadmarri commented 3 years ago

Also, sorry if my question is something obvious, but could you explain me how the rdb interval works, and what's the difference between running at a lower and a higher interval. I'd appreciate that.

Hi @MoisesMH

The interval is a number in milliseconds where each cpu runs load balancer every interval 0: load balancing runs every tick 4: load balancing runs every 4ms and so on.

Low value helps to balance more but with the cost of increasing runqueues locking High value doesn't balance often but it reduces runqueues locking time.

Thanks

hamadmarri commented 3 years ago

Hi @all

Since the current version of RDB is broken, I will disable it by default. You can still use the older RDB versions where no autogroup support until a fix is found.

Thank you

ptr1337 commented 3 years ago

@hamadmarri

You can just create a own RDB.patch like you did earlier or move the current patch into experimental.

Since i faced with the current RDB in no issues with autogroup-

SoongVilda commented 3 years ago

My experiences linux-cacule-rdb-autogroup

Firefox, telegram, steam and playing Xonotic, no issues stable and high fps.

raykzhao commented 3 years ago

Hi @hamadmarri,

Since majority of the issues reported here happen during wine/gaming, maybe it is a good idea to look at the locking. I suspect maybe there are some issues in latest rdb/autogroup with futex2. Also some game developers are known to use locking mechanisms in the way that it is not supposed to be used.

ptr1337 commented 3 years ago

Even when using games with futex2 i dont face in any issues.

Going to test again, but im sure there is no problem.

hamadmarri commented 3 years ago

Hi @hamadmarri,

Since majority of the issues reported here happen during wine/gaming, maybe it is a good idea to look at the locking. I suspect maybe there are some issues in latest rdb/autogroup with futex2. Also some game developers are known to use locking mechanisms in the way that it is not supposed to be used.

Hi @raykzhao

I am not sure actually because most of the feedback are not strongly related. Non fixes worked with @ltsdw but with @JohnyPeaN changing to periodic hz worked. Also, I have tested my proposed fix and it reduces the performance to be worse than CFS balancer. I guess the best way is to make RDB works with periodic hz and without {auto, fair}_group. The locking issues on games could be a reason but why it is ok with CFS balancer and bad on RDB? I thought it was because the CFS balancer goes through softirq but even with the fix where I made the RDB balancer use softirq it didn't fixed the freezes. I am afraid it is due something else that RDB didn't take care of.

If you don't mind @all could you please attach the cpu topology with lstopo. It could be related to shared core balancing or number of CPUs in which many locking is an issue.

Thank you

ltsdw commented 3 years ago

@hamadmarri

I don't know if it was to put an image here or something else, but here:

Screenshot-20-08-2021_09-45-25

Thank you for your support!

raykzhao commented 3 years ago

Hi @hamadmarri,

This is my laptop: Screenshot_2021-08-21_01-04-06

JohnyPeaN commented 3 years ago

@hamadmarri this is the machine on which I'm testing:

Screenshot_lstopo_2

MoisesMH commented 3 years ago

Hey @hamadmarri

This is the machine I'm testing: AMD Ryzen 5 3600 6-core processor 2x8GB DDR4 2666 RAM 256GB NVMe M.2 SSD 2TB HDD Drive 4GB GDDR6 VRAM RX 5500 XT

lstopo

JohnyPeaN commented 3 years ago

@hamadmarri I have made a discovery. The lagging is caused by compositor, not inside the game engine (I noticed that mangohud was showing 60fps constantly). So If I disable the plasma compositor, the game is fluid even with RDB. With compositor enabled:

cacule = no lags cacule + rdb = heavy lags cacule + rdb + fix = very short, but frequent and noticeable lags cacule + rdb + periodic = no lags

So it seems that the compositor gets neglected under certain circumstances and although game renders its images, they are not shown.

Here is top of perf session:

41.15%  swapper          [kernel.vmlinux]                      [k] acpi_idle_enter
10.13%  swapper          [kernel.vmlinux]                      [k] acpi_processor_ffh_cstate_enter
 1.42%  RDR2.exe         ntdll.so                              [.] __fsync_wait_objects
 1.03%  RDR2.exe         ntdll.so                              [.] __wine_syscall_dispatcher
 1.02%  RDR2.exe         [kernel.vmlinux]                      [k] native_sched_clock