sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
906 stars 85 forks source link

scx_rustland eventually times out for no explicable reason #89

Closed kode54 closed 9 months ago

kode54 commented 9 months ago

I've tried twice running rustland with linux-cachyos 6.7.0-4, and it times out within a few minutes to an hour of running. I previously ran it in a TTY with exec, so I had no output after it died. The second time I ran it, I ran it under tmux, and captured that it stopped updating status output for 3 seconds, then updated status again, and immediately WARN output due to the 5s watchdog timing out, and was terminated.

arighi commented 9 months ago

@kode54 can you check in the stdout if the nr_page_faults counter is 0 or > 0?

Even with the custom allocator + locking all the memory I think there are still cases where the user-space scheduler can still page fault. And that's problematic, because if it faults we can have a deadlock condition (e.g., a ktherad needs to run to resolve the page fault, but it can't run because the scheduler is waiting for the kthread to resolve the page fault).

I'm thinking for example to ksm, kcompactd, remapping huge pages... maybe other cases. If that's the case, I would say that it sounds like a kernel bug, if we mlockall() all the memory in task I would assume that the kernel never umap/remap my pages, under any condition...

kode54 commented 9 months ago

I just replaced my 0.1.5 package with the -git package, which has the nr_page_faults counter. I'll report back here if/when it terminates again.

Edit: It hasn't paged out yet, but instead it's regularly pegging an entire core while a Plex transcoder is running in the background.

arighi commented 9 months ago

Oh ok, that's good news! The 0.1.5 version doesn't even have the custom allocator, so it is highly susceptible to hit page faults. The page fault issue was fixed (mitigated) by 9708a80, that is not in 0.1.5.

arighi commented 9 months ago

BTW, I'm also stress testing this potential workaround (https://github.com/arighi/scx/commit/c5a69eec3b24d715f5c3300cc84473abeea8902b). With this applied the scheduler seems to survive to page faults. We still need the custom allocator (9708a80), because we want to prevent page faults from happening as much as possible in the user-space scheduler (they can still introduce global system lags), but with this one, even if they happen, at least the scheduler seems to survive.

kode54 commented 9 months ago

Also, is it normal for rustland to use up to a full core of processing time most of the time it's running, or really any time anything is using the processor elsewhere?

arighi commented 9 months ago

Also, is it normal for rustland to use up to a full core of processing time most of the time it's running, or really any time anything is using the processor elsewhere?

When the system is mostly idle it should be also idle, like right now I only have the browser open, my email client, irc client and it's using 0.3-1% of cpu:

 339714 root      20   0  139628  73944   4608 R   0.3   0.5   0:00.12 scx_rustland                                                      

If I start a game, a build, etc. it can go up a lot, almost using a full core. If your system is mostly idle and rustland is using a full core, then there's a bug...

But I'm planning to do some tracing and see if we can optimize things a bit, because I have the feeling that sometimes we still have unnecessary wakeups.

arighi commented 9 months ago

Edit: It hasn't paged out yet, but instead it's regularly pegging an entire core while a Plex transcoder is running in the background.

oh! and I totally missed the edit, sorry. How much cpu % is using the plex transcoder? Is it using multiple cpus?

arighi commented 9 months ago

Moreover, about the cpu usage, can you try to apply this patch and see if it makes any difference? https://github.com/arighi/scx/commit/7bf70170693c2bbd53304e3436938af39a018652

kode54 commented 9 months ago

It seems to go up a lot if system load average goes up significantly, even if it's not from pure CPU load. For instance, a lot of I/O from bcachefs on 7200 RPM drives. And this also causes my desktop compositor, Wayfire, to become quite stuttery. The stuttering goes away when the I/O goes back down. It also goes away if I terminate scx_rustland and let it revert to kernel scheduling.

arighi commented 9 months ago

ok, I'll do some tests with some I/O bound workloads. My guess is that with more I/O, tasks are releasing the CPU more often, so there's more work to do for the scheduler.

kode54 commented 9 months ago

Incidentally, the Plex process ends up using between 500% and 800%, possibly even 1200%, brute force decoding videos to locate their intro and end credits timings. It kicks in the moment a full season has been copied or muxed into place.

arighi commented 9 months ago

Moreover, to clarify about the cpu load, we should expect to see a higher load / cpu usage with rustland, in particular the cpu % used by scx_rustland itself, because, unlike other schedulers, it's doing all the scheduling work in user-space. So, the scheduling time that is usually done (transparently) by the kernel, is now done in user-space and accounted to a particular task (scx_rustland).

This should be noticed especially when lots of tasks are running or when certain tasks are acquiring and releasing the CPUs very frequently (i.e., I/O bound tasks): the scheduler has more work to do, therefore scx_rustland uses more cpu.

We can do some optimizations for sure, avoiding unnecessary wakeups for the scheduler, but they would affect the load when the system is mostly idle, when lots of tasks are competing each other, acquiring and releasing CPUs, the scheduler needs to do a lot of work, otherwise the system becomes sluggish and unresponsive.

But that is about system load / cpu usage. If the system becomes unresponsive under certain conditions, then it's a potential scheduling problem and we should focus on that.

kode54 commented 9 months ago

I can probably close this, and open a different issue regarding I/O, because that seems to affect most, if not all schedulers that I've tried.