Closed Pilleo closed 3 years ago
Hi @Pilleo
Can i do smth to figure out was it cacule problem or not?
Please try sudo journalctl -b -1
. the -1
is the number of previous boot. Please figure out witch boot the problem happened. You can try -2, -3, -4 ... until you find the logs when that problem happened (you can check the time on top)
If you found the boot then redirect the journal output to a file and upload here please
sudo journalctl -b -1 > log.txt
Could you please confirm that this issue doesn't happen with mainline kernel or with xanmod without cacule?
If this problem is only happening with cacule, please try this fix patch on top of 5.10.1 xanmod/cacule source. patch2.zip
Please let me know if it fixes the issue.
Thanks
Another thing that please double check your swap size. Your RAM might got filled with no enough swap area. Also check the vm.swapness <--- if you haven't change it then don't worry about it (the default value is ok).
Let me leave the topic. I want to ask, when will the latest version of CacULE or Cachy for kernel v5.10 be released? EDIT: I mean patchfile
My Distro: Gentoo/Linux (compile from source code)
Hi @owl4ce
The patch of 5.9 is fortunately working on 5.10
I am trying to fix the freezing problem and release a version with fixes. Could please try the patch in https://github.com/hamadmarri/cacule-cpu-scheduler/issues/20#issuecomment-751235392 and let me know if it smoother and not mini-freezes under heavy load?
Thank you
I cannot confirm or refute that it was only with cacule. I have 16 gb ram and 1 gb swap with swappiness 1. And if I am developing Java, than I am always short on memory. Earlier I had frequent freezes, so I installed EarlyOOM. It worked just fine so far. Maybe this time it could not work for some reason, maybe it is smth else. But I did not have that particular problem without cacule with EarlyOOM present. log.txt I was trying to use Magic SysRq key, but think kernel could not respond to it Hope it will help.
Hi @Pilleo
Isn't there a kernel bug at the end of the log? It seems to have something to do with SLUB/usercopy.
@hamadmarri , @raykzhao Does it mean it is not a cacule problem? Should we report somewhere else?
@hamadmarri , @raykzhao Does it mean it is not a cacule problem? Should we report somewhere else?
Hi @Pilleo
I don't think it's a cacule issue. I think it is related to the swap since SLUB is a memory allocator. It could be SLUB is not able to allocate more memory. I am sure that cacule has nothing to do with slub or memory issues.
You may want to report to earlyoom, or your distro forum?
Thank you
Is it possible that the problem is n kernel 5.10? In my understanding it should have at least react at magic key, but it was a complete crash
Is it possible that the problem is n kernel 5.10? In my understanding it should have at least react at magic key, but it was a complete crash
Yes, it could be.
I can confirm this, happens during high CPU intensive tasks (xanmod-cacule 5.10 kernel, i3 2 core, 2 thread).
Example, shader precompile of steam games hangs my system (I saw 4 process fossilize_replay
, but all of them were using more than 25% of the cpu each, thus pushing the system to hang)
This doesn't occur for memory intensive tasks for me. Also, this wasn't an issue when it was still Cachy
:)
I can confirm this, happens during high CPU intensive tasks (xanmod-cacule 5.10 kernel, i3 2 core, 2 thread). Example, shader precompile of steam games hangs my system (I saw 4 process
fossilize_replay
, but all of them were using more than 25% of the cpu each, thus pushing the system to hang)This doesn't occur for memory intensive tasks for me. Also, this wasn't an issue when it was still
Cachy
:)
I also experienced same thing.
Hi @hamadmarri @Salekin-1169 @owl4ce
I just checked the latest xanmod-cacule setting. It doesn't include the starvation fix, and SCHED_AUTOGROUP
(and therefore FAIR_GROUP_SCHED
) are enabled. Also it includes the following scheduler tweaks and uses the non-standard 500Hz timer (no such an option in mainline):
sysctl_sched_nr_migrate = 256
sysctl_sched_rt_runtime = 980000
I just opened an issue at xanmod/linux#112 and mentioned the starvation fix and disabling FAIR_GROUP_SCHED
.
I can confirm this, happens during high CPU intensive tasks (xanmod-cacule 5.10 kernel, i3 2 core, 2 thread). Example, shader precompile of steam games hangs my system (I saw 4 process
fossilize_replay
, but all of them were using more than 25% of the cpu each, thus pushing the system to hang)This doesn't occur for memory intensive tasks for me. Also, this wasn't an issue when it was still
Cachy
:)
Hi @Salekin-1169
This was fixed (hopefully) with the last commit. If you compile from source can you try to patch it with this patch https://github.com/hamadmarri/cacule-cpu-scheduler/issues/15#issuecomment-751220411
Thank you
I can confirm this, happens during high CPU intensive tasks (xanmod-cacule 5.10 kernel, i3 2 core, 2 thread). Example, shader precompile of steam games hangs my system (I saw 4 process
fossilize_replay
, but all of them were using more than 25% of the cpu each, thus pushing the system to hang) This doesn't occur for memory intensive tasks for me. Also, this wasn't an issue when it was stillCachy
:)I also experienced same thing.
Hi @owl4ce
Could you please double check that this patch https://github.com/hamadmarri/cacule-cpu-scheduler/issues/15#issuecomment-751220411 is applied?
You can confirm by
cat kernel/sched/fair.c | grep -B2 -A2 "p->se.exec_start = 0;"
Output
#if !defined(CONFIG_CACULE_SCHED)
/* We have migrated, no longer consider this task hot */
p->se.exec_start = 0;
#endif
Thank you
I can confirm this, happens during high CPU intensive tasks (xanmod-cacule 5.10 kernel, i3 2 core, 2 thread). Example, shader precompile of steam games hangs my system (I saw 4 process
fossilize_replay
, but all of them were using more than 25% of the cpu each, thus pushing the system to hang) This doesn't occur for memory intensive tasks for me. Also, this wasn't an issue when it was stillCachy
:)I also experienced same thing.
Hi @owl4ce
Could you please double check that this patch #15 (comment) is applied?
You can confirm by
cat kernel/sched/fair.c | grep -B2 -A2 "p->se.exec_start = 0;"
Output
#if !defined(CONFIG_CACULE_SCHED) /* We have migrated, no longer consider this task hot */ p->se.exec_start = 0; #endif
Thank you
Yes, I patched it. I am not sure about the problem, when the cpu usage is high and I play the song also sometimes it comes back a few seconds then comes back again. Feels very heavy when all cores are used at 100% usage.
I believe there is an issue in v5.10 please see https://github.com/xanmod/linux/issues/111
It could be a bug in mainline kernel v5.10 since the guy how posted the issue in xanmod is not using cacule. I don't think the issue is in xanmod since there is not a big change on xanmod patches from v5.9 to v5.10
Can anyone confirm the freezes is also in v5.9 using the latest cacule patch?
Thank you
I can confirm this, happens during high CPU intensive tasks (xanmod-cacule 5.10 kernel, i3 2 core, 2 thread). Example, shader precompile of steam games hangs my system (I saw 4 process
fossilize_replay
, but all of them were using more than 25% of the cpu each, thus pushing the system to hang) This doesn't occur for memory intensive tasks for me. Also, this wasn't an issue when it was stillCachy
:)I also experienced same thing.
Hi @owl4ce Could you please double check that this patch #15 (comment) is applied? You can confirm by
cat kernel/sched/fair.c | grep -B2 -A2 "p->se.exec_start = 0;"
Output
#if !defined(CONFIG_CACULE_SCHED) /* We have migrated, no longer consider this task hot */ p->se.exec_start = 0; #endif
Thank you
Yes, I patched it. I am not sure about the problem, when the cpu usage is high and I play the song also sometimes it comes back a few seconds then comes back again. Feels very heavy when all cores are used at 100% usage.
Can you please try the latest Cacule patch on mainline kernel v5.9 (without xanmod). Sorry about asking too much compiling, the problem could be not related to cacule.
Thank you
This solution is suggested by Alexandre Frade Thanks to him
Hamad When executing the nvidia-dkms and mkinitramfs triggers, started the freezes reported by users, even with the fix "remove start_exec = 0", the system normalized without the autogroup:
echo 0 |sudo tee /proc/sys/kernel/sched_autogroup_enabled
all users with this problem, try this solution
Hi everyone. I can confirm that as said on this can end this issue.
By disabling this in kernel configuration (menuconfig)
SCHED_AUTOGROUP
FAIR_GROUP_SCHED
Yes, it worked for me. Everything is now smooth, even when compiling programs until all cores are 100% usage. xanmod 5.10.3 cacule
@hamadmarri @raykzhao @owl4ce sorry for the late reply. I updated the latest xanmod-cacule package and tested. Running steam shader precompile with youtube running in background. my system didn't come to a complete hang like before, but it was still unusable (audio skips in youtube, screen lags)
After that, I changed the value of kernel.sched_interactivity_factor
to 50
and the system immediately became responsive.
I can even record what's happening currently on my pc now, which is impossible with the default value of kernel.sched_interactivity_factor
@hamadmarri @raykzhao @owl4ce sorry for the late reply. I updated the latest xanmod-cacule package and tested. Running steam shader precompile with youtube running in background. my system didn't come to a complete hang like before, but it was still unusable (audio skips in youtube, screen lags)
After that, I changed the value of
kernel.sched_interactivity_factor
to50
and the system immediately became responsive.
What was the default value? 32768? Or 10?
@hamadmarri 32768
@hamadmarri I have a comparatively weak cpu (i3 2 physical + 2 logical cores), so I think the default 32768
isn't suitable for my pc, while other people didn't face issue with this.
@hamadmarri I have a comparatively weak cpu (i3 2 physical + 2 logical cores), so I think the default
32768
isn't suitable for my pc, while other people didn't face issue with this.
Hi @Salekin-1169
I believe the problem is in this line. It seems that a task that has high run time will get lower value! I am not sure if original ULE used it this way. I will try one more fix. Also I will add reset_life_time as same as cachy has.
When a task flips to be non interactive where it should have a score somehow closer to it's previous score, but from the math I can see that the score jumps to the lowest and gradually starts to gain some score.
Can you please try this patch. I fixed the interactivity score equation. It is now very similar to Cachy/HRRN
Hi @hamadmarri
I think the cacule_max_lifetime
in the new patch should be better tunable via sysctl. Also it doesn't seem to include the starvation fix.
Can you please try this patch. I fixed the interactivity score equation. It is now very similar to Cachy/HRRN
@hamadmarri Thank you for your quick fix :bow:
I apologize, because I'm still relatively new to linux and don't know how to compile custom kernel yet. I'll study about it this weekend and check if the latest patch fixes the issue, then let you know :vulcan_salute:
Hi @Salekin-1169
I'm not sure which Linux distro you are using. Generally speaking, you may try:
patch -p1 -i interactivity_score_fix.patch
/proc/config.gz
. If not, try to run modprobe configs
first. zcat /proc/config.gz > .config
make menuconfig
and make sure you disable the following:
General Setup-->Automatic process group scheduling
General Setup-->Control Group support-->CPU controller-->Group scheduling for SCHED_OTHER
You may also want to distinguish with the existing kernel by appending some suffix at General Setup-->Local version - append to kernel release
.
make -jx
, where x is the number of CPU cores. For example, if you have a 4-core CPU, you may run make -j4
.make install
and make modules_install
.For the initramfs, bootloader, and out-of-tree modules, you should lookup the instructions of your specific Linux distro. Some distros may also have a guide on how to build custom kernels.
Hi @hamadmarri
I think the
cacule_max_lifetime
in the new patch should be better tunable via sysctl. Also it doesn't seem to include the starvation fix.
Hi @raykzhao
The starvation fix actually is not a good solution since a task can keep migrating even if it is cache hot. I think the fix in interactivity score is better approach. Also I was mistaken about vruntime getting reset, only exec_start got reset on migration, which gets updated right away in set_next_task func.
Thank you
Hi @hamadmarri
Although the new fix seems better than the vanilla CacULE on my machine, unfortunately it doesn't feel better than the original Cachy or the CacULE with either the smoother or the starvation fix. Somehow it's slightly more glitchy on my machine under heavy load.
I still think both Cachy and CacULE should be kept in a single patch (#19), since it's difficult to find a scheduling policy that is suitable for everyone. On my machine the original Cachy is definitely the winner, but it may not be the case for others.
Same, but on my machine Cachy is more responsive but not under heavy load, CacULE is much more responsive if my machine is running heavy duty under 100% pressure of all cores. In conclusion, on my machine CacULE can lighten the load when all cores are 100%, I mean that it can still move freely.
Hi @hamadmarri
Although the new fix seems better than the vanilla CacULE on my machine, unfortunately it doesn't feel better than the original Cachy or the CacULE with either the smoother or the starvation fix. Somehow it's slightly more glitchy on my machine under heavy load.
I still think both Cachy and CacULE should be kept in a single patch (#19), since it's difficult to find a scheduling policy that is suitable for everyone. On my machine the original Cachy is definitely the winner, but it may not be the case for others.
Hi @raykzhao
What I am worried about is that the problem is not on cachy nor on cacule. Maybe it's on v5.9 and v5.10. Could please confirm that cachy has no issues on both v5.9 and v5.10?
Thank you
Hi @hamadmarri
I can confirm that there is no difference on my machine with Cachy scheduler between 5.9 and 5.10 kernels.
Hi @Salekin-1169
I just found that the interactivity score fix should already be merged to xanmod/linux@16e99a8c42f36bc83be0af522e26d16e1cd64e98 and therefore 5.10.4-xanmod1-cacule should include the fix. You don't need to build your own custom kernel now.
@hamadmarri @raykzhao sorry about the delayed reply. I tested the latest patch, and it fixed my issue completely. Also, I tested on a freshly reinstalled system, so all the values are set to default.
I only faced some minor audio lags (very minimal) during recording, but other than that, everything was butter smooth.
Thank you so much for your support, the issue is completely fixed for me :bow: Wish you all a very Happy New Year :beers:
@hamadmarri @raykzhao sorry about the delayed reply. I tested the latest patch, and it fixed my issue completely. Also, I tested on a freshly reinstalled system, so all the values are set to default.
I only faced some minor audio lags (very minimal) during recording, but other than that, everything was butter smooth.
Thank you so much for your support, the issue is completely fixed for me bow Wish you all a very Happy New Year beers
Glad to hear it is working good now. Happy new year @Salekin-1169 .
I learned so much about how ULE works from this article and slide (Can't find an appropriate place to share, so posting here temporarily).
ULE by design will cause starvation, but will yield better throughput it seems? I had completely opposite idea about it.
I learned so much about how ULE works from this article and slide (Can't find an appropriate place to share, so posting here temporarily).
ULE by design will cause starvation, but will yield better throughput it seems? I had completely opposite idea about it.
Hi @Salekin-1169
The research paper: https://www.usenix.org/system/files/conference/atc18/atc18-bouron.pdf Section: 3 Porting ULE to the Linux kernel
This study unfortunately is not fair, it is an implementation of ULE on top of CFS. They replaced some functions in CFS with ULE way of their implementation. I don't call this a fair comparison. Their conclusion based on the stats and results they had from this unfair implementation in which they conclude that ULE could lead to starvation.
Yes, I think CFS is more advance than ULE, but I am not really convinced that ULE lead to starvation based on a single unfair study. I have tried FreeBSD, it is smoother and haven't faced any kind of starvation while stressing the system with many kind of tests. ULE code is cleaner, 10x less than CFS, and probably faster, however, magically, CFS provides slightly higher throughput than ULE.
Thank you
EDIT: You can see Table1 on the paper of implemented functions. I don't think they have implemented ULE balancer too.
@hamadmarri thank you so much for your explanation. I'm still learning about it. CFS has about a decade of optimizations behind it, so for that it's throughput is higher I guess ?
Just curious, how different is Cacule from the FreeBSD implementation of ULE? Can I read up about it somewhere?
@hamadmarri thank you so much for your explanation. I'm still learning about it. CFS has about a decade of optimizations behind it, so for that it's throughput is higher I guess ?
Just curious, how different is Cacule from the FreeBSD implementation of ULE? Can I read up about it somewhere?
CFS is a result of previous crazy approaches. One of them was the scheduler made by the genius guy called Con Kolivas (the author of bfs and muqss) - his scheduler then was Staircase scheduler. Ingo Molnár made CFS which is inspired by Kolivas's work. FreeBSD didn't have a crazy guy like Kolivas.
Just curious, how different is Cacule from the FreeBSD implementation of ULE? Can I read up about it somewhere?
https://github.com/hamadmarri/cacule-cpu-scheduler#the-cacule-interactivity-score
I have only implemented the interactivity (IS) score (see Figure 1 https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/helper%20docs%20for%20kernel%20dev/FreeBSD/ULE.pdf)
My implementation of IS is not similar to ULE, it is more like the Cachy/HRRN way. ULE uses 2 runqueues, and based on the IS math, the task is placed to some runqueue. They also use multiple levels queues for priority. Where I just use the CFS's vruntime calculations which affect the task priority on vruntime value. Which therefore affects the total run time of the process when calculating HRRN or IS
I used a shortcut approach to adapt IS to CFS. ULE is totally different than CFS. I remember that ULE is about 3k LOC only, where CFS is absolutely is more than 25k LOC (I counted only 4 files in CFS no all)
Here just 4 files LOC in linux I deal with daily
❯ cat kernel/sched/fair.c | wc -l
11872
~/dev/linux/linux rdb
❯ cat kernel/sched/core.c | wc -l
8498
~/dev/linux/linux rdb
❯ cat kernel/sched/sched.h | wc -l
2643
~/dev/linux/linux rdb
❯ cat include/linux/sched.h | wc -l
2075
Thank you
Please let me know if the new cachy-r9 (with rdb balancer) is better than both cachy-r8/cacule
https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/patches/Cachy/v5.9/cachy-5.9-r9.patch
I can confirm that setting kernel.sched_interactivity_factor = 50 fixed this issue.
Hello! Not sure if it was problem of cacule. Using 5.10.1 xanmod with cacule. So I had running Itellij Idea, Java project, Firefox with hundreds of tabs and a youtube playing. All ram and swap were occupied, but it was all great until it wasn't. Suddenly everything just freezed. earlyoom did not not work, even magic button did not responce. Can i do smth to figure out was it cacule problem or not? Thank you.