scx_rustland_core: performance regression due to kernel change

arighi commented 1 month ago

This commit in the kernel introduces a pretty bad performance regression in all the scx_rustland_core schedulers:

7c65ae81ea86 ("sched_ext: Don't call put_prev_task_scx() before picking the next task")

System becomes completely unresponsive when it's saturated and it's very easy to reproduce (i.e., starting a parallel kernel build with scx_rustland active).

I think the reason is one (or both) of these behavior changes:

    This causes two behavior changes observable from the BPF scheduler:

    - When a task keep running, it no longer goes through enqueue/dequeue cycle
      and thus ops.stopping/running() transitions. The new behavior is better
      and all the existing schedulers should be able to handle the new behavior.

    - The BPF scheduler cannot keep executing the current task by enqueueing
      SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
      BPF scheduler is responsible for resuming execution after each
      SCX_ENQ_LAST.

But I haven't figured out exactly why, I've been playing a bit with SCX_ENQ_LAST, unsuccessfully, so I'm just opening the issue for now. Any pointers on how to attack this?

arighi commented 1 month ago

I think I found a much easier reproducer, see 7f9b009c9c772e04c9da614fc6056dc9a6c47f0d.

It seems that in 6.12, ops.update_idle() is occasionally not being called. scx_rustland_core depends on ops.update_idle() to trigger the wakeup of the user-space scheduler to handle pending tasks, so skipping it leads to poor performance. This issue is likely related to changes of pick_next_task() / put_prev_task() in the kernel.

I don't have a fix yet, I'm just sharing the reproducer for now, I'll investigate more on the kernel side.

arighi commented 1 month ago

FYI, https://lore.kernel.org/lkml/20241013173928.20738-1-andrea.righi@linux.dev/T/#u seems to fix this regression.

sched-ext / scx

scx_rustland_core: performance regression due to kernel change #788