sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
692 stars 48 forks source link

SCX has problems resuming from standby #342

Closed SaberJ2X closed 3 weeks ago

SaberJ2X commented 1 month ago

Greetings

I've been using CachyOS with a Sched-EXT kernel and using lavd on a Steam Deck (though it might apply to other systems) that doesn't happen when I disable EXT-SCHED (sudo systemctl stop scx)

not really sure what to mention but... a thread seems to get stuck and it causes the device to turn really unresponsive to even inputs and audio de-syncs or gets garbled

the steps to reproduce this problem are... boot into gamemode (default behavior) go into a game after I'm in game, I press standby after a few minutes, resume play... then I quit the game and try to open another game it usually does it this quickly, but sometimes it needs a second try (standby -> resume -> change game) and if you have onscreen performance numbers enabled you'll see a core, or multiple, stuck on 100%

I've included pictures and a log IMG20240607185822 IMG20240607193301 lavd3.log

ptr1337 commented 1 month ago

Same does occur, when using rusty and going into sleep. https://paste.cachyos.org/p/4020138.log

latest 6.9 Release, NVIDIA 4070S, 7950X3D, 0.1.10 scx-release.

xuanruiqi commented 4 weeks ago

Having the same error here. Causes extreme lag and frequent freezes.

htejun commented 4 weeks ago

Can someone who can reproduce the problem run the following drgn script?

https://github.com/sched-ext/sched_ext/blob/sched_ext/tools/sched_ext/scx_show_state.py

It will look like this:

# drgn os/work/tools/sched_ext/scx_show_state.py
ops           : rusty
enabled       : 1
switching_all : 1
switched_all  : 1
enable_state  : enabled (2)
bypass_depth  : 0
nr_rejected   : 0
multics69 commented 4 weeks ago

Thank you, @SaberJ2X, for reporting the problem.

There were two issues (as least I found).

One is the kernel does not properly run ops.cpu_online() callback when resumed. This problem should be fixed by this PR.

scx_lavd had one more issue. It incorrectly adds up the suspended time to the task's runtime, so the scheduler recognizes the system is extremely busy after the resume. It should be fixed by this PR.

multics69 commented 3 weeks ago

The issue is closed after kernel fix and scx_lavd fix.