Experiencing some random hangs under heavy workload

hamadmarri / cacule-cpu-scheduler

The CacULE CPU scheduler is based on interactivity score mechanism. The interactivity score is inspired by the ULE scheduler (FreeBSD scheduler).

265 stars 32 forks source link

Experiencing some random hangs under heavy workload #47

Open ltsdw opened 3 years ago

ltsdw commented 3 years ago

I've been experiencing these hangs (where everything freezes for like 5 secs) when playing some games on wine that usually uses a lot of the CPU, sometimes when watching some videos.

To be sure that was cacule patch and nothing else I tested with the mainline arch kernel (no hangs). As I have some patches applied at my kernel I tried compiling it without the cacule patch (also no hangs). And then tried applying the cacule again and the hangs comes back.

I'm not quite sure. But I think that the commit that introduced it is the 06cb3974.

I didn't tried reverting the commit to test, only tested with these:

cacule-patch-with-hangs.txt - patch where hangs happens

cacule-without-hangs.txt - and without the hangs

But if needed I can try bisecting later to see exactly which commit causes it.

hamadmarri commented 3 years ago

@hamadmarri I have made a discovery. The lagging is caused by compositor, not inside the game engine (I noticed that mangohud was showing 60fps constantly). So If I disable the plasma compositor, the game is fluid even with RDB. With compositor enabled:

cacule = no lags cacule + rdb = heavy lags cacule + rdb + fix = very short, but frequent and noticeable lags cacule + rdb + periodic = no lags

So it seems that the compositor gets neglected under certain circumstances and although game renders its images, they are not shown.

Here is top of perf session:
41.15%  swapper          [kernel.vmlinux]                      [k] acpi_idle_enter
10.13%  swapper          [kernel.vmlinux]                      [k] acpi_processor_ffh_cstate_enter
 1.42%  RDR2.exe         ntdll.so                              [.] __fsync_wait_objects
 1.03%  RDR2.exe         ntdll.so                              [.] __wine_syscall_dispatcher
 1.02%  RDR2.exe         [kernel.vmlinux]                      [k] native_sched_clock

Hi @JohnyPeaN

I think it is related to tick update where RDB-r3 needs to update the highest IS task in every tick. However, previous RDB version was using a bit different approach since enqueue was sorted.

Do the lags happen on previous RDB version (where no sched_group support)?

Thank you for the observation :+1:

hamadmarri commented 3 years ago

Hey @hamadmarri

This is the machine I'm testing: AMD Ryzen 5 3600 6-core processor 2x8GB DDR4 2666 RAM 256GB NVMe M.2 SSD 2TB HDD Drive 4GB GDDR6 VRAM RX 5500 XT

Hi @MoisesMH

Just to double check, could you please try with CONFIG_HZ_PERIODIC=y without the fix patch. I recommend using make menuconfig to enable CONFIG_HZ_PERIODIC since it does set the corresponding configs automatically so you don't need to worry about other CONFIG_NO_HZ_* settings.

What I am thinking is that you and @JohnyPeaN have many CPUs where there are high probability that some of them turn to idle state and no_hz wake up didn't work with RDB. Also I am afraid that @ltsdw needs to retry with CONFIG_HZ_PERIODIC=y and make sure no compilation errors and check if CONFIG_HZ_PERIODIC=y is enabled after installation.

Another suspicion is that the RDB-r3 balance tries to pick from all tasks in rq where some of them are in RT policy! In contrast, previous RDB version was just using rq->cfs tasks to balance. So, it could be that the plasma compositor are a RT task policy (not sure), but if it is the case, then RDB is keep balancing RT task (due to moving one task a time) and cfs tasks are not balanced at all (during the freezes). Could @JohnyPeaN please check what policy the plasma compositor is?

I am 100% sure that RDB is not considering the nohz kicking to wakeup idle cpus, and if setting periodic tick works for all of you, then we know that it is about nohz wakeup kicker. However, if @ltsdw still has freezes while using periodic tick, then we might have another issue as well.

Please make sure that:

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

During testing to make sure that the freezes are not related to the cache or starve scores.

Thank you

MoisesMH commented 3 years ago

@ltsdw mentioned he used rdb without autogroup and it gave him no spikes.

At the moment, I've compiled the kernel I've used without the patch and these parameters:

RDB Interval: 19 (default). CONFIG_HZ_1000=y CONFIG_SCHED_AUTOGROUP=n CONFIG_NO_HZ=y (I've read it's used for old configs, so I kept it enabled) CONFIG_NO_HZ_IDLE=y (Tickless idle)

Also I've tweaked some options for the kernel configuration. I'll post it here just in case you want to take a look:

https://drive.google.com/file/d/1eR6NIPe88lc1SCz_nqGjNXPPRPOSavv0/view?usp=sharing

For me it's weird because 15 minutes ago I was testing about 30 minutes of gameplay in Star Wars Battlefront II. I was using Mangohud latest version from AUR (not the mangohud-git one). The first 15 minutes approximately I've experienced no spikes at all and the framerate was constant and smooth, but, since then, I've encountered some little ones every 5 minutes I guess, which lasted 2 seconds each. Then it seemed spikes were gone, until my game froze 5 secs, just like when I've got autogroup enabled. After the freeze, audio and video were unpair for a second and then it turned back to normality. So It's more related to heavy workload, as the title of this forum suggests. My CPU usage was about 54 to 59% during gameplay and GPU at 99%, which is expected because of the graphics card rendering the shaders and everything else. I was using RDB-r2 I guess, because it's included in the linux-tkg kernel provided by @TkGlitch. I put the links below:

Linux-tkg kernel configuration (he also quoted the cacule link he's using, which refers to the "latest commit 6f2ede5 on May 20"): https://github.com/Frogging-Family/linux-tkg/blob/master/customization.cfg https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/patches/CacULE/RDB/rdb.patch#L56

So that is the cacule-rdb version I'm using. Should I test RDB-r3 or RDB-r2 is fine? I'm not sure exactly which version that commit belongs to, but I can test compiling it manually. I don't know if the AUR version linux-cacule-rdb presents these problems too, but I'll try first the one included on the linux-tkg kernel. I prefer it, because it has more patches which can increase performance and improve the cpu efficiency. However, I'm starting to think one of those patches could be causing the problem too.

On the other hand, those theories you mention can be possible. I haven't tested without the compositor. I don't know how I could deactivate it. I'll search for that and test without it too. Currently I'm using OpenGL 2.0. There's also OpenGL 3.1 available. I've read many people suggested Compton as a replacement. I could test it too. That's my progress till now. I'll keep testing and I'll notify if CONFIG_HZ_PERIODIC=y and the parameters kernel.sched_cache_factor = 0 and kernel.sched_starve_factor = 0 make any difference. Thanks for the reply!

ltsdw commented 3 years ago

hi @hamadmarri

Just recompiled here, with CONFIG_HZ_PERIODIC=y and tested with:

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

but no difference, I'm still experiencing the hangs.

Also was mentioned the compositor here, I don't know if disabling the compositor worked for you @JohnyPeaN, but I tried disabling the compositor here and didn't make any difference (but I'm using picom, not plasma).

So far what worked was disabling RDB altogether or using noautogroup.

JohnyPeaN commented 3 years ago

@hamadmarri i'm not sure which process is responsible for compositing, but I think its kwin_x11. Anyway it has normal priority (0) as the rest of the desktop. I will try to change its priority if it has an effect.

Earlier RDB versions had these problems for me. Maybe they changed a little. Earlier RDB couldn't utilize all cores during compilation with #threads=#cores. This seems to be better now. In regards to these lags in game, it was similar.

I'm also testing if foreground processes are affected by heavy background processes, like mentioned compilation withnice -19. This doesn't work good for me on anything except BMQ (but bmq is changing priorities of processes on the fly, which is maybe a little bit cheating).

Also, I'm not recompiling to test autogroup on/off. Just to confirm does echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled switch it off?

ltsdw commented 3 years ago

Also, I'm not recompiling to test autogroup on/off. Just to confirm does echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled switch it off?

@JohnyPeaN Yeah I do think so, you can also use the kernel command line noautogroup.

MoisesMH commented 3 years ago

hey @hamadmarri I've tried compiling the kernel as you suggested (with CONFIG_HZ_PERIODIC=y instead of CONFIG_NO_HZ_IDLE=y) but, at builiding, some modules gave me errors and, because of that, I was afraid it weren't building adequately. Besides, it finished the compilation in less than 7 minutes. Usually all kernels I've compiled lasted between 15 to 20 minutes to compile. For that reason, it's suspicious. Maybe that error interrupted the whole process. I'm going to attach a fragment where the output errors appear when CONFIG_HZ_PERIODIC=y:

CC kernel/sched/clock.o CC fs/crypto/keysetup_v1.o CC fs/verity/signature.o CC arch/x86/events/amd/uncore.o CC fs/notify/notification.o CC mm/maccess.o AR fs/verity/built-in.a CC mm/page-writeback.o CC fs/crypto/policy.o CC fs/notify/group.o CC kernel/sched/cputime.o CC kernel/sched/idle.o CC arch/x86/events/amd/ibs.o CC fs/crypto/bio.o CC fs/notify/mark.o CC arch/x86/events/amd/iommu.o CC fs/crypto/inline_crypt.o CC kernel/sched/fair.o CC kernel/sched/rt.o CC fs/notify/fdinfo.o CC mm/readahead.o CC [M] arch/x86/events/amd/power.o AR fs/crypto/built-in.a CC mm/swap.o AR fs/notify/built-in.a CC fs/nfs_common/nfs_ssc.o kernel/sched/fair.c: In function ‘newidle_balance’: kernel/sched/fair.c:11324:17: error: implicit declaration of function ‘nohz_newidle_balance’; did you mean ‘nohz_run_idle_balance’? [-Werror=implicit-function-declaration] 11324 | nohz_newidle_balance(this_rq); | ^~~~~~~~ | nohz_run_idle_balance CC [M] fs/nfs_common/nfsacl.o AR arch/x86/events/amd/built-in.a CC arch/x86/events/intel/core.o CC arch/x86/events/intel/bts.o CC arch/x86/events/zhaoxin/core.o CC [M] fs/nfs_common/grace.o LD [M] fs/nfs_common/nfs_acl.o CC mm/truncate.o AR fs/nfs_common/built-in.a CC fs/iomap/trace.o CC mm/vmscan.o AR arch/x86/events/zhaoxin/built-in.a CC mm/shmem.o CC fs/iomap/apply.o CC arch/x86/events/intel/ds.o CC fs/iomap/buffered-io.o cc1: some warnings being treated as errors make[2]: [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1 make[1]: [scripts/Makefile.build:516: kernel/sched] Error 2 make: [Makefile:1862: kernel] Error 2 make: Waiting for unfinished jobs.... CC fs/iomap/direct-io.o CC arch/x86/events/intel/knc.o CC mm/util.o CC mm/mmzone.o CC arch/x86/events/intel/lbr.o

On the other hand, when I compile with just tickless idle (CONFIG_NO_HZ_IDLE=y), the kernel compiles without any errors. I've only applied cacule, uksm, futex2, security, and more uarches patches. It also gave me errors because I've also applied fsync, which is a previous version of the more advanced futex2 approach, but they're other functionalities. That explains why, with the PKGBUILD provided by TkGlitch, gave me those errors too when CONFIG_HZ_PERIODIC=y is applied. I don't know why. I think you should inspect those lines. I don't think the other patches are causing the problem, since there's no other scheduler I've integrated, and CacULE replaces CFS. That's all the information I can provide. Greetings!

NOTE: maybe the aim of your scheduler is only programmed to work exclusively with full tickless and just tickless idle kernels? Maybe I'm confused hehe

hamadmarri commented 3 years ago

hey @hamadmarri I've tried compiling the kernel as you suggested (with CONFIG_HZ_PERIODIC=y instead of CONFIG_NO_HZ_IDLE=y) but, at builiding, some modules gave me errors and, because of that, I was afraid it weren't building adequately. Besides, it finished the compilation in less than 7 minutes. Usually all kernels I've compiled lasted between 15 to 20 minutes to compile. For that reason, it's suspicious. Maybe that error interrupted the whole process. I'm going to attach a fragment where the output errors appear when CONFIG_HZ_PERIODIC=y:

CC kernel/sched/clock.o CC fs/crypto/keysetup_v1.o CC fs/verity/signature.o CC arch/x86/events/amd/uncore.o CC fs/notify/notification.o CC mm/maccess.o AR fs/verity/built-in.a CC mm/page-writeback.o CC fs/crypto/policy.o CC fs/notify/group.o CC kernel/sched/cputime.o CC kernel/sched/idle.o CC arch/x86/events/amd/ibs.o CC fs/crypto/bio.o CC fs/notify/mark.o CC arch/x86/events/amd/iommu.o CC fs/crypto/inline_crypt.o CC kernel/sched/fair.o CC kernel/sched/rt.o CC fs/notify/fdinfo.o CC mm/readahead.o CC [M] arch/x86/events/amd/power.o AR fs/crypto/built-in.a CC mm/swap.o AR fs/notify/built-in.a CC fs/nfs_common/nfs_ssc.o kernel/sched/fair.c: In function ‘newidle_balance’: kernel/sched/fair.c:11324:17: error: implicit declaration of function ‘nohz_newidle_balance’; did you mean ‘nohz_run_idle_balance’? [-Werror=implicit-function-declaration] 11324 | nohz_newidle_balance(this_rq); | ^~~~~~~~ | nohz_run_idle_balance CC [M] fs/nfs_common/nfsacl.o AR arch/x86/events/amd/built-in.a CC arch/x86/events/intel/core.o CC arch/x86/events/intel/bts.o CC arch/x86/events/zhaoxin/core.o CC [M] fs/nfs_common/grace.o LD [M] fs/nfs_common/nfs_acl.o CC mm/truncate.o AR fs/nfs_common/built-in.a CC fs/iomap/trace.o CC mm/vmscan.o AR arch/x86/events/zhaoxin/built-in.a CC mm/shmem.o CC fs/iomap/apply.o CC arch/x86/events/intel/ds.o CC fs/iomap/buffered-io.o cc1: some warnings being treated as errors make[2]: [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1 make[1]: [scripts/Makefile.build:516: kernel/sched] Error 2 make: [Makefile:1862: kernel] Error 2 make: Waiting for unfinished jobs.... CC fs/iomap/direct-io.o CC arch/x86/events/intel/knc.o CC mm/util.o CC mm/mmzone.o CC arch/x86/events/intel/lbr.o

On the other hand, when I compile with just tickless idle (CONFIG_NO_HZ_IDLE=y), the kernel compiles without any errors. I've only applied cacule, uksm, futex2, security, and more uarches patches. It also gave me errors because I've also applied fsync, which is a previous version of the more advanced futex2 approach, but they're other functionalities. That explains why, with the PKGBUILD provided by TkGlitch, gave me those errors too when CONFIG_HZ_PERIODIC=y is applied. I don't know why. I think you should inspect those lines. I don't think the other patches are causing the problem, since there's no other scheduler I've integrated, and CacULE replaces CFS. That's all the information I can provide. Greetings!

NOTE: maybe the aim of your scheduler is only programmed to work exclusively with full tickless and just tickless idle kernels? Maybe I'm confused hehe

Hi @MoisesMH

Could you please try this fix https://github.com/hamadmarri/cacule-cpu-scheduler/issues/47#issuecomment-901082918

I will update the fix in the github soon.

Thanks

EDIT:

https://github.com/hamadmarri/cacule-cpu-scheduler/commit/bb773768683ad2754329a9f3629d4948b3b47c03

MoisesMH commented 3 years ago

hey @hamadmarri I've compiled a kernel with your latest commit and applied some additional patches, but the kernel was not appropriately working, because, when gaming, the framerates weren't balanced and the CPU usage was too high (I guess that happened because of esync; futex2 was not working, even if I patched it. So I proceeded to test the fix you suggested me to try in the last message you wrote applied to the TkGlitch's linux-tkg kernel, which has an earlier version of your scheduler I guess:

I've got to say, in my system, even with CONFIG_HZ_PERIODIC=y, it's still having lag spikes, but they're less frequent than with CONFIG_NO_HZ_IDLE=y. I also used the variables you suggested in my /etc/sysctl.conf

kernel.sched_cache_factor = 0 kernel.sched_starve_factor = 0

and then executed "sudo sysctl --system" to apply the changes to kernel in the document but, still, those hangs are present. Disabling autogroup (kernel.sched_autogroup_enabled=0) helped a little to reduce the frequency of those lag spikes and its duration (lasted up to 2 secs each hang when it happens). Before, when CONFIG_NO_HZ_IDLE=y, they lasted 5 secs at average. In the menu, everything is smooth, even on gameplay, when the hangs are not present, the game runs butter-smoothly. Oh, another detail is that, while a hang is present, the Mangohud overlay reveals the CPU usage soared 10% more on average (from 55% to 65%, even it reached 74%). It's weird that It just happens on heavy workload. For other tasks, like running Audacious or Lutris, it's noticeably faster than without RDB. It surprises me the celerity at opening different applications. For those jobs it's butter-smooth, but just happens when at intensive gameplay. That's all I got. I really don't have an idea why it just happens at intensive workload. Maybe the code is not adapted to deal with it and just with ordinary tasks. I'll remain here for more news. Thanks for the effort. Keep it up!

EDIT: could you apply the two last commits to the cacule-5.14.patch please? I want to apply it for testing with the futex2-dev kernel from Collabora. Thanks!

hamadmarri commented 3 years ago

EDIT: could you apply the two last commits to the cacule-5.14.patch please? I want to apply it for testing with the futex2-dev kernel from Collabora. Thanks!

Hi @MoisesMH Updated 5.14 https://github.com/hamadmarri/cacule-cpu-scheduler/tree/master/patches/CacULE/v5.14

Thank you

MoisesMH commented 3 years ago

Nice! Thanks to you. I don't know but lastly I've tried the liquorix kernel with the MuQSS scheduler (CONFIG_HZ_100=y is default), android modules, ntfs3 and uksm. What surprises me is the CPU usage. At the game menu of Star Wars Battlefront II, your scheduler with linux-tkg (CONFIG_HZ_1000=y is default) consumes 24% to 32-33% of CPU usage, but, with, this new kernel, it was reaching a whopping 54% to 59% of CPU Usage. I don't believe it's uksm which is incrementing CPU Usage, because its main function is memory deduplication. It's not possible in my opinion. Also, at gameplay, your scheduler were around 54% to 62% of CPU Usage, while lqx-kernel with MuQSS reached from 66% up to 79%. It's impressive how optimized the linux-tkg kernel is compared to liquorix. Well I haven't tried the linux-tkg with uksm. I'm going to compile it now and see how it does with CacULE with and without RDB for testing. Keep it up!

hamadmarri commented 3 years ago

It could be kernel.sched_cacule_yield related to the issue. Can you please try with

kernel.sched_cacule_yield = 0

Thank you

MoisesMH commented 3 years ago

Hey @hamadmarri I've used kernel.sched_cacule_yield=0 in my sysctl.conf, but it didn't help. Instead, it became unstable and I saw more lag spikes during co-op gameplay, but not at the game menu. So it performs noticeably better when kernel.sched_cacule_yield=1. Oh, I've compiled with CONFIG_NO_HZ_IDLE=y and a rdb interval of 15. Also, I was testing with UKSM. I noticed there are less frequent lag spikes with this configuration. I don't know which configuration helped to neutralize some of the hangs: the new RDB interval with CONFIG_NO_HZ_IDLE=y or UKSM could be helping too. I wonder what the results would be when using periodic ticks and kernel.sched_cacule_yield=0. Cheers!

hamadmarri commented 3 years ago

Hi @MoisesMH , @raykzhao , @JohnyPeaN , @ltsdw , @ptr1337 , @SoongVilda

I am planning to make a rework on RDB and start it over from the beginning. I need to review how nohz idle wakeup mechanism works first. Also I am thinking to make some extra features where some CPUs are assigned to be an interactive tasks servant (where it gives more priority to interactive tasks but still can run non-interactive tasks at the same time). This idea are based on this (https://www.researchgate.net/profile/Julien-Soula/publication/254213707_ARTiS_an_Asymmetric_Real-Time_Scheduler_for_Linux_on_Multi-Processor_Architectures/links/00b495350104a70d19000000/ARTiS-an-Asymmetric-Real-Time-Scheduler-for-Linux-on-Multi-Processor-Architectures.pdf)

The next RDB must consider all nohz work, and maybe a global queue for candidates tasks in which one task from each CPU (the task that has the highest IS but not running). Each CPU will have one slot in the global queue and it must guarantee that the task that is advertised in the global queue must be ready to migrate at any time, unless the slot has a null value.

The locking number could be increased but the queue is not very big it only contains nproc items.

Thanks

MoisesMH commented 3 years ago

Hey @hamadmarri That article seems interesting. Later I'm gonna read more. On the other hand, I've made a discovery too haha. I haven't ever though of tweaking those values you provided when you introduced them in a discussion panel. I'm referring to cache_factor and starving_factor (at https://github.com/hamadmarri/cacule-cpu-scheduler/discussions/43). First, I've changed the cache_factor to 0 as you suggested and the starving factor to 15944. When playing SWBF 2, all of the hangs were apparently gone on a map. Then the match changed to another, which it was more resource hungry I guess because of the more complex graphics (terrain, leaves, ambient occlusion, etc). And then appeared two or three hangs. Then I've changed the cache_value to 8192 as you suggested to increase it. The performance was the same till one big hang appeared (5 secs I guess). Then it became back to normal. The game was fluid. What's important here is that playing with those settings affected the way RDB were performing. I'm not sure if I have to lower the starving_factor to avoid those peaks or not. You mentioned raising it will make the system run smoother, but I'm suspecting raising it too much will lead to starve more groups of applications. I'm not sure about that but I'll try with a better value to see if it's true. Also, I have a doubt: after finding a starve_factor that fits me, then why do you mention we have to raise the cache_factor the most we can? In a less intensive map, when cache_factor=0 and starve_factor=15944, the game was running with no peaks at all, also the framerates were stable. But in these new more intensive map, I've seen just one or two peaks, when cache_factor was 0 or 8192. Indeed with 8192, I've seen one or two peaks more than with 0. How can you explain it to understand the tweaking I've done? Greetings and care yourself!

EDIT: my current RDB value is 15 and running with CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ=y. Also, I've disabled the compositor with Alt+Shift+F12 keys combination.

EDIT 2: These are the combinations which gave me almost no spikes on Star Wars Battlefront II: 1) One or two lag spikes. The framerate was constant, even on an intensive graphical rendering map such as Ajan Kloss) sched_cache_factor = 7972 sched_starve_factor = 19930

2) No spikes at all (I don't remember if I tested the map Ajan Kloss with this configuration, but other maps were running flawlessly) sched_cache_factor = 3986 sched_starve_factor = 17937

Incrementing cache factor were only worsen things and won't let enjoy a decent gameplay experience, since, in my opinion, too much can be the cause of those lag spikes. My numbers were inspired by my total installed memory, seen with "free" command on console. It returned a total of 15944 MB (16GB). Numbers I've got: 7972 1993 17937 19930 3986 5979 4983.

hamadmarri commented 3 years ago

Hey @hamadmarri That article seems interesting. Later I'm gonna read more. On the other hand, I've made a discovery too haha. I haven't ever though of tweaking those values you provided when you introduced them in a discussion panel. I'm referring to cache_factor and starving_factor (at #43). First, I've changed the cache_factor to 0 as you suggested and the starving factor to 15944. When playing SWBF 2, all of the hangs were apparently gone on a map. Then the match changed to another, which it was more resource hungry I guess because of the more complex graphics (terrain, leaves, ambient occlusion, etc). And then appeared two or three hangs. Then I've changed the cache_value to 8192 as you suggested to increase it. The performance was the same till one big hang appeared (5 secs I guess). Then it became back to normal. The game was fluid. What's important here is that playing with those settings affected the way RDB were performing. I'm not sure if I have to lower the starving_factor to avoid those peaks or not. You mentioned raising it will make the system run smoother, but I'm suspecting raising it too much will lead to starve more groups of applications. I'm not sure about that but I'll try with a better value to see if it's true. Also, I have a doubt: after finding a starve_factor that fits me, then why do you mention we have to raise the cache_factor the most we can? In a less intensive map, when cache_factor=0 and starve_factor=15944, the game was running with no peaks at all, also the framerates were stable. But in these new more intensive map, I've seen just one or two peaks, when cache_factor was 0 or 8192. Indeed with 8192, I've seen one or two peaks more than with 0. How can you explain it to understand the tweaking I've done? Greetings and care yourself!

EDIT: my current RDB value is 15 and running with CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ=y. Also, I've disabled the compositor with Alt+Shift+F12 keys combination.

EDIT 2: These are the combinations which gave me almost no spikes on Star Wars Battlefront II:

One or two lag spikes. The framerate was constant, even on an intensive graphical rendering map such as Ajan Kloss) sched_cache_factor = 7972 sched_starve_factor = 19930

No spikes at all (I don't remember if I tested the map Ajan Kloss with this configuration, but other maps were running flawlessly) sched_cache_factor = 3986 sched_starve_factor = 17937

Incrementing cache factor were only worsen things and won't let enjoy a decent gameplay experience, since, in my opinion, too much can be the cause of those lag spikes. My numbers were inspired by my total installed memory, seen with "free" command on console. It returned a total of 15944 MB (16GB). Numbers I've got: 7972 1993 17937 19930 3986 5979 4983.

Hi @MoisesMH

The cache factor seems not working good with RDB design. I need to troubleshoot cache and starve factors too.

Thank you

MoisesMH commented 3 years ago

Yeah, it seems to be generating the issue. I've discovered another combination, which is close to the default I think:

kernel.sched_cache_factor = 10629 kernel.sched_starve_factor = 21258

I've experienced no spikes at all with this configuration, but at the beginning of gameplay a peak happened, but I haven't noticed any freezes or big hangs. I think I'll remain with this configuration. Both sums less than the sched_interactivity_factor, also one is the double of the other (1/3 31888, 2/3 31888). Hope you're doing great with your investigation and development!