hamadmarri / cacule-cpu-scheduler

The CacULE CPU scheduler is based on interactivity score mechanism. The interactivity score is inspired by the ULE scheduler (FreeBSD scheduler).
266 stars 32 forks source link

Major lockups with wine games #4

Closed barolo closed 3 years ago

barolo commented 4 years ago

When playing something with Cemu emulator, it will start locking up for couple seconds at the time then will lock whole the system, tested with few others, it seems to occur when CPU is particularly strained, GPU just locks up

[ AMD Raven, 4 core APU ]

hamadmarri commented 4 years ago

Hi Barolo,

I am worried that is because of disabling cgroup? I am not sure if Cemu emulator depends on cgroups or not. What patch you are using 5.8? Could it be missing kernel modules?

edit: I remember having similar issue while compiling the kernel, the system freezes for a second. That happened on both CFS and Cachy. But never happened on MUQQS. I believe there is a problem related with heavy IO waits. This problem might be inherited from the original linux kernel scheduler.

barolo commented 4 years ago

5.8, The same version sans cachy with CFS runs fine. attached my .config It's not only Cemu. I'm checking some native CPU straining games now.

config.tar.gz

Do you mean that CONFIG_CGROUPS has to be disabled entirely?

hamadmarri commented 4 years ago

Do you mean that CONFIG_CGROUPS has to be disabled entirely?

No, only Fair_group should be disabled. But maybe some processes need the functionality of Fair_group to perform well. I am working on making the scheduler works with Fair_group functionality. I have looked at your config file, it seems OK, I am not able to figure out why this problem happened. Can you give 5.7.10 version a try?

Have you tried make localmodconfig before building the kernel? This is to make sure the same config of previous kernel version is copied.

barolo commented 4 years ago

It's the same config, I've copied it by hand. Running 5.7 would nullify any potential benefits of cachy because amdgpu upgrades in 5.8 make GPU intensive tasks faster [ at least on my hardware ]

hamadmarri commented 4 years ago

I am going to test and debug the scheduler with cemu emulator next week. Please attach any log files related to this issue if any. Thanks

hamadmarri commented 4 years ago

Does the whole system freeze for couple seconds? It seems it is the same issue when compiling the kernel with -j$(nproc) + 1. The system freezes for couple seconds and back to normal after that.

barolo commented 4 years ago

It starts with a couple secs then gets progressively worse till it locks up completely.

barolo commented 4 years ago

I see that xanmod folks decided to include it in their kernel, gonna test that one and see how it behaves

hamadmarri commented 4 years ago

Hi Greg,

This commit fixed the threads throttle issue: https://github.com/hamadmarri/linux/commit/6406ba353670bbcafed5b8f39b0e57410b5375d3 https://github.com/hamadmarri/linux/tree/hrrn_lifetime

Could you please check if the problem is still existed with this commit?

Thanks

arabcian commented 4 years ago

I have frequent mini freezes with the thief 2014 using wine with 5.8.3 Patch without other fixes.

hamadmarri commented 4 years ago

I have frequent mini freezes with the thief 2014 using wine with 5.8.3 Patch without other fixes.

Yes, that's because new tasks have short lived time and their hrrn value is huge compared with old running tasks such as X. This has been fixed in hrrn_lifetime branch

hamadmarri commented 4 years ago

This is the patch for v5.8 https://github.com/hamadmarri/cachy-sched/blob/master/patches/cachy/v5.8/cachy-5.8-r2.patch

Please let me know if the issue is still happens.

arabcian commented 4 years ago

Thx. I'll do a test run.

arabcian commented 4 years ago

This is compared 5.8.6 PDS vs 5.8.8 CACHY

https://browser.geekbench.com/v5/cpu/compare/3666484?baseline=3734231

While cachy single core performance better than pds, in multicore bench pds is ahead thanks to numa=y

Sadly mini freezes still happening with the new patch.

hamadmarri commented 4 years ago

Hi @arabcian ,

Could you please test with different sched_hrrn_latency_us on cachy (default is 0)

sudo sysctl kernel.sched_hrrn_latency_us=6000 
sudo sysctl kernel.sched_hrrn_latency_us=12000

first one latency for 6ms second latency for 12ms You can also tune this value as you like (negative and positive values) Please let me know which value that causes less/non freezes

If NUMA=n on PDS, would that cause any differences in benchmark?

arabcian commented 4 years ago

Ok i was wrong about the effect of numa in benchmark because when i bench 5.8.8-pds with numa=n i get this.

https://browser.geekbench.com/v5/cpu/3738412

And this is with 5.8.6-cachy

https://browser.geekbench.com/v5/cpu/3738483

I cant always reproduce mini freezes sometimes it happens more sometimes less Setting sudo sysctl kernel.sched_hrrn_latency_us=40000 Mostly eliminates problem or setting cpu governor to performance almost completely eliminate the problem.

Update: I did a test run on many games and can confirm micro freezes happens only with thief 2014. So somehow thief engine which is unreal that doesnt like cachy. I think you should concentrate on why geekbench score is lower than with pds. You can add one of the geekbench scores as baseline and load other to one by one compare performance differences like this

https://browser.geekbench.com/v5/cpu/compare/3738483?baseline=3738412

duud commented 4 years ago

I'm on linux 5.8.9 with cachy-5.8-r2.patch.

I'm getting regular freezes (sometimes after about 20/30sec sometimes after 1 or 2minutes) while playing Quake Champions using wine. It freezes sometimes for quite a long period - for about 5sec. Changing sched_hrrn_latency_us doesn't improve the situation, setting sched_hrrn_max_lifetime_ms to 1000s eliminated the freezes on a 5 minutes test run.

Latency is very low in comparison with cfs/muqqs which is very noticeable in a fast game like Quake. -Qauke Champions is heavily multithreaded. As far as I can tell it uses a thread pool of about 60 threads. Maybe there is an issue how cachy bahaves with a large thread pool - the engine picks every frame random threads out of the pool which might have been suspended for a long period.

hamadmarri commented 4 years ago

Ok i was wrong about the effect of numa in benchmark because when i bench 5.8.8-pds with numa=n i get this.

https://browser.geekbench.com/v5/cpu/3738412

And this is with 5.8.6-cachy

https://browser.geekbench.com/v5/cpu/3738483

I cant always reproduce mini freezes sometimes it happens more sometimes less Setting sudo sysctl kernel.sched_hrrn_latency_us=40000 Mostly eliminates problem or setting cpu governor to performance almost completely eliminate the problem.

Update: I did a test run on many games and can confirm micro freezes happens only with thief 2014. So somehow thief engine which is unreal that doesnt like cachy. I think you should concentrate on why geekbench score is lower than with pds. You can add one of the geekbench scores as baseline and load other to one by one compare performance differences like this

https://browser.geekbench.com/v5/cpu/compare/3738483?baseline=3738412

Hi @arabcian ,

It is worth to note that the tests in geekbench are all about performance not on interactivity or responsiveness. Cachy perform poorly on tasks such as zip/unzip, any kind of compression, and almost all the tests listed in geekbench here https://browser.geekbench.com/v5/cpu/compare/3738483?baseline=3738412.

I believe that CFS is superior on those tests compared to other schedulers.

barolo commented 4 years ago

I've found in my benches as much, raw throughput, fps is best with tweaked CFS. alternative schedulers like cachy are viable if latency is of concern. Xanmod guys scrapped cachy for the time being due to its instability.

arabcian commented 4 years ago

und in my benches as much, raw throughput, fps is best with tweaked CFS

I get the best performance with pds/prjc 100hz tick and no preemption(server). Intel TSX on, no retpoline, orc unwinder. Unneeded stuff removed and wine-staging with esync and fsync patches applied, compiled from the source with march=native graphite optimizations and gold linker as i use Gentoo those are easy to setup. All of them together gives me like %15 more performance.

barolo commented 3 years ago

@arabcian I'm on GentooLTO ( you shouldn't be using gold it's unmaintained and buggy ) PRJC results in an odd behaviour in some emus where it caps cores to their mids and spreads work across them, where in many such apps it's the single core perf that matters. Also I'm on AMD without hyperthreading/ccx, just real cores, maybe that's the reason

hamadmarri commented 3 years ago

Could you please try this branch, I added autogroup, cgroups, fair_group support https://github.com/hamadmarri/linux/tree/groups_numa

Let me know if it is better with this branch

arabcian commented 3 years ago

Could you please try this branch, I added autogroup, cgroups, fair_group support https://github.com/hamadmarri/linux/tree/groups_numa

Let me know if it is better with this branch

I tried the version bundled with Xanmod-sources 5.8.11 but sadly microfreezes still happening. I couldnt make 5.9 work in my last try.

hamadmarri commented 3 years ago

I will commit a good version of cachy in the next two days: no freezes, good performance

hamadmarri commented 3 years ago

Please try this patch https://github.com/hamadmarri/cachy-sched/blob/master/patches/cachy/v5.8/cachy-5.8-r5.patch

Note: Enable Numa, and Cgroups, cachy now supports numa and cgroups

Please let me know if the issue is solved with this patch

arabcian commented 3 years ago

Please try this patch https://github.com/hamadmarri/cachy-sched/blob/master/patches/cachy/v5.8/cachy-5.8-r5.patch

Note: Enable Numa, and Cgroups, cachy now supports numa and cgroups

Please let me know if the issue is solved with this patch

Trying.

arabcian commented 3 years ago

Ok youre on the right track. Freezes gone. Geekbench score is better. Ill do a test run for you. I built kernel with the no forced preemption and 100Hz tick. Im really kinda unsensitive when is comes to snappiness they bring with more tweaked settings. I just feel no difference between No forced preemption vs Low latency kernel or 100Hz vs 1000Hz. I mean this is the same with any kernel i tried before. I use XFCE4 desktop right now which is very snappy enough already. Ill keep you updated. Just tell me which workloads you would like me to test and ill do tests comparing with the PDS/PRJC and maybe MuQSS which i dont like.

hamadmarri commented 3 years ago

Ok youre on the right track. Freezes gone. Geekbench score is better. Ill do a test run for you. I built kernel with the no forced preemption and 100Hz tick. Im really kinda unsensitive when is comes to snappiness they bring with more tweaked settings. I just feel no difference between No forced preemption vs Low latency kernel or 100Hz vs 1000Hz. I mean this is the same with any kernel i tried before. I use XFCE4 desktop right now which is very snappy enough already. Ill keep you updated. Just tell me which workloads you would like me to test and ill do tests comparing with the PDS/PRJC and maybe MuQSS which i dont like.

Maybe try games, check fps, and you might try compile the kernel and while it's compiling check the response of the system like browse the web or do any task. Cachy should be fast in response while system in load. You might also compare geekbench scores between other scheds.

Thank you so much

hamadmarri commented 3 years ago

Ok youre on the right track. Freezes gone. Geekbench score is better. Ill do a test run for you. I built kernel with the no forced preemption and 100Hz tick. Im really kinda unsensitive when is comes to snappiness they bring with more tweaked settings. I just feel no difference between No forced preemption vs Low latency kernel or 100Hz vs 1000Hz. I mean this is the same with any kernel i tried before. I use XFCE4 desktop right now which is very snappy enough already. Ill keep you updated. Just tell me which workloads you would like me to test and ill do tests comparing with the PDS/PRJC and maybe MuQSS which i dont like.

Maybe try games, check fps, and you might try compile the kernel and while it's compiling check the response of the system like browse the web or do any task. Cachy should be fast in response while system in load. You might also compare geekbench scores between other scheds.

Thank you so much

Some think like the below youtube test:

I made comparison between cfs and cachy on xanmod, for blind test test1: https://youtu.be/DilwWlNbExg?t=14 test2: https://youtu.be/1S3OxLrcbGY?t=14

to reveal the which is which go back to time 0s on the video and see uname -r output

Note: In one of the tests, the recorder seems to be freezes and lagging, I repeated this test twice, while testing system is not pausing but the recorder maybe freezing or lagging while recording.

Please let me know which one you felt is more responsive, notice also the heavy load make the screen recorder lagging. That's why I like to record with my phone which shows the real responsiveness.

hamadmarri commented 3 years ago

When playing something with Cemu emulator, it will start locking up for couple seconds at the time then will lock whole the system, tested with few others, it seems to occur when CPU is particularly strained, GPU just locks up

[ AMD Raven, 4 core APU ]

Hi @barolo ,

Could you please confirm if the issue is resolved with this patch https://github.com/hamadmarri/cachy-sched/blob/master/patches/cachy/v5.8/cachy-5.8-r5.patch ?

Thanks

arabcian commented 3 years ago

I tried Thief 2014 and Wow WOTLK yet. Gaming performance is not much different. Its as good as PRJC. Something i realized that when i play Wow i watch the cpu load with mangohud and cachy causes one of the cores randomly spike %100 load which i think is impossible. Meanwhile system keeps cpu frequency in maximum turbo frequency. Im not sure if this is a problem or not. With PRJC cpu load would share accross first 4 core and frequency would fluctuate 900-2000MHz.

hamadmarri commented 3 years ago

I tried Thief 2014 and Wow WOTLK yet. Gaming performance is not much different. Its as good as PRJC. Something i realized that when i play Wow i watch the cpu load with mangohud and cachy causes one of the cores randomly spike %100 load which i think is impossible. Meanwhile system keeps cpu frequency in maximum turbo frequency. Im not sure if this is a problem or not. With PRJC cpu load would share accross first 4 core and frequency would fluctuate 900-2000MHz.

In cachy-5.8-r5, the CPU balancing is exactly the same as CFS code, I just uncommented CFS code. The only thing I can think of is that I removed some power/saving efficiency calculation code. Another reason, maybe the way HRRN policy works can be the reason since the next task to run should have higher hrrn value, no timeslice considered. Saying that, I couldn't notice any cpu spikes while testing. Could you please double check that NUMA, CGROUPS, FAIR_GROUP are enabled? Is the .config file just same as the one used to compile PROJC?

Thank you

arabcian commented 3 years ago

Hello again.

TESTS

5.5.8-UndeadPDS

Wow FPS: 135

7z:

Avr: 723 3771 27258 | 794 2911 23104 Tot: 758 3341 25181

SysBench

CPU speed: events per second: 9204.71

General statistics: total time: 10.0007s total number of events: 92065

Latency (ms): min: 0.83 avg: 0.87 max: 5.45 95th percentile: 0.87 sum: 79991.11

Threads fairness: events (avg/stddev): 11508.1250/20.37 execution time (avg/stddev): 9.9989/0.00

5.8.11-Cachy

Wow FPS : 133

7z:

Avr: 683 3794 25929 | 792 2911 23061 Tot: 738 3353 24495

SysBench:

CPU speed: events per second: 9193.63

General statistics: total time: 10.0008s total number of events: 91954

Latency (ms): min: 0.82 avg: 0.87 max: 30.04 95th percentile: 0.87 sum: 79989.98

Threads fairness: events (avg/stddev): 11494.2500/30.65 execution time (avg/stddev): 9.9987/0.00

5.9.0-Tuned CFS

Wow FPS: 133

7z:

Avr: 674 3764 25381 | 795 2909 23124 Tot: 734 3337 24253

SysBench:

CPU speed: events per second: 9173.94

General statistics: total time: 10.0006s total number of events: 91756

Latency (ms): min: 0.74 avg: 0.87 max: 30.87 95th percentile: 0.87 sum: 79973.55

Threads fairness: events (avg/stddev): 11469.5000/20.01 execution time (avg/stddev): 9.9967/0.01

arabcian commented 3 years ago

Could you please double check that NUMA, CGROUPS, FAIR_GROUP are enabled

Ah its ok there is no CPU spikes anymore, i guess that was related to that i did run wine with WINEESYNC=1, without it it doesnt happen anymore everything looks normal.

hamadmarri commented 3 years ago

Hello again.

TESTS

5.5.8-UndeadPDS

Wow FPS: 135

7z:

Avr: 723 3771 27258 | 794 2911 23104 Tot: 758 3341 25181

SysBench

CPU speed: events per second: 9204.71

General statistics: total time: 10.0007s total number of events: 92065

Latency (ms): min: 0.83 avg: 0.87 max: 5.45 95th percentile: 0.87 sum: 79991.11

Threads fairness: events (avg/stddev): 11508.1250/20.37 execution time (avg/stddev): 9.9989/0.00

5.8.11-Cachy

Wow FPS : 133

7z:

Avr: 683 3794 25929 | 792 2911 23061 Tot: 738 3353 24495

SysBench:

CPU speed: events per second: 9193.63

General statistics: total time: 10.0008s total number of events: 91954

Latency (ms): min: 0.82 avg: 0.87 max: 30.04 95th percentile: 0.87 sum: 79989.98

Threads fairness: events (avg/stddev): 11494.2500/30.65 execution time (avg/stddev): 9.9987/0.00

5.9.0-Tuned CFS

Wow FPS: 133

7z:

Avr: 674 3764 25381 | 795 2909 23124 Tot: 734 3337 24253

SysBench:

CPU speed: events per second: 9173.94

General statistics: total time: 10.0006s total number of events: 91756

Latency (ms): min: 0.74 avg: 0.87 max: 30.87 95th percentile: 0.87 sum: 79973.55

Threads fairness: events (avg/stddev): 11469.5000/20.01 execution time (avg/stddev): 9.9967/0.01

Thanks for the stats, could you please indicates which is better (high/less is better) for each results. FPS for example high is better. I am not sure for example 7z is it seconds? less is better?

Thank you

arabcian commented 3 years ago

In 7z Test the higher is better, first stat is usage %, second and third is mips.

hamadmarri commented 3 years ago

In 7z Test the higher is better, first stat is usage %, second and third is mips.

It seems that we need to work on performance :/

Thank you so much @arabcian

arabcian commented 3 years ago

In 7z Test the higher is better, first stat is usage %, second and third is mips.

It seems that we need to work on performance :/

Thank you so much @arabcian

I think its inside the error margin. There is not much difference in performance comparing to other schedulers. Maybe if usage was higher like it was in PDS Scheduler it would give the almost same performance. But working to improve performance is always appreciated.

arabcian commented 3 years ago

İs there any update in the code? I can test it.

hamadmarri commented 3 years ago

İs there any update in the code? I can test it.

I am making three different versions for cachy testing. 1- Similar to old cachy but with max_hrrn_lifetime: This is harsh responsive and might cause some freezes but in normal use it so responsive 2- Somehow smoother than old cachy, it shouldn't cause any freezes but less responsive than .1 3- Similar to .2 but with postpone nohz balance for 1s each time.

all 3 don't have dynamic priority, DP reduced the responsiveness/performance of cachy so far.

hamadmarri commented 3 years ago

cachy-test1.zip

Please test this patch on 5.8

If you like you could join us here https://t.me/cachy_sched in telegram

Thank you