scx_rustland stalling tasks dispatched in the per-CPU DSQs

arighi commented 3 weeks ago

A user reported a stall condition with scx_rustland, where tasks dispatched to the per-CPU DSQs may trigger the sched_ext watchdog. The trace looks like the following:

Jun 13 05:27:08 SoulHarsh007 bash[1097]: DEBUG DUMP
Jun 13 05:27:08 SoulHarsh007 bash[1097]: ================================================================================
Jun 13 05:27:08 SoulHarsh007 bash[1097]: kworker/u64:0[186743] triggered exit kind 1026:
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   runnable task stall (panel-91-system[2686] failed to run for 5.761s)
Jun 13 05:27:08 SoulHarsh007 bash[1097]: Backtrace:
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   scx_watchdog_workfn+0x154/0x1e0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   process_one_work+0x19d/0x360
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   worker_thread+0x2fa/0x490
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   kthread+0xd2/0x100
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   ret_from_fork+0x34/0x50
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   ret_from_fork_asm+0x1a/0x30
Jun 13 05:27:08 SoulHarsh007 bash[1097]: CPU states
Jun 13 05:27:08 SoulHarsh007 bash[1097]: ----------
Jun 13 05:27:08 SoulHarsh007 bash[1097]: CPU 9   : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=23798810 pnt_seq=90222261
Jun 13 05:27:08 SoulHarsh007 bash[1097]:           curr=swapper/9[0] class=idle_sched_class
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   R panel-91-system[2686] -5761ms
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       sticky/holding_cpu=-1/-1 dsq_id=0x9 dsq_vtime=0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       cpus=ffff
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     __x64_sys_poll+0x58/0x180
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     do_syscall_64+0x82/0x160
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jun 13 05:27:08 SoulHarsh007 bash[1097]: CPU 11  : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=19746198 pnt_seq=63418051
Jun 13 05:27:08 SoulHarsh007 bash[1097]:           curr=swapper/11[0] class=idle_sched_class
Jun 13 05:27:08 SoulHarsh007 bash[1097]:   R scx_rustland[1097] +0ms
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=2/19746197
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       cpus=ffff
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     do_syscall_64+0x82/0x160
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jun 13 05:27:08 SoulHarsh007 bash[1097]: CPU 13  : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=19979467 pnt_seq=64167440
Jun 13 05:27:08 SoulHarsh007 bash[1097]:           curr=kworker/u64:0[186743] class=ext_sched_class
Jun 13 05:27:08 SoulHarsh007 bash[1097]:  *R kworker/u64:0[186743] +0ms
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:       cpus=ffff
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     scx_dump_state+0x6e9/0x8b0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     scx_ops_error_irq_workfn+0x40/0x50
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     irq_work_run_list+0x53/0x90
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     irq_work_run+0x18/0x50
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     __sysvec_irq_work+0x1c/0xb0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     sysvec_irq_work+0x6c/0x90
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     asm_sysvec_irq_work+0x1a/0x20
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     scx_watchdog_workfn+0x16d/0x1e0
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     process_one_work+0x19d/0x360
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     worker_thread+0x2fa/0x490
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     kthread+0xd2/0x100
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     ret_from_fork+0x34/0x50
Jun 13 05:27:08 SoulHarsh007 bash[1097]:     ret_from_fork_asm+0x1a/0x30
Jun 13 05:27:08 SoulHarsh007 bash[1097]: ================================================================================
Jun 13 05:27:08 SoulHarsh007 bash[1097]: 23:57:08 [INFO] Unregister RustLand scheduler
Jun 13 05:27:08 SoulHarsh007 bash[1097]: Error: EXIT: runnable task stall (panel-91-system[2686] failed to run for 5.761s)

Considering that scx_rustland doesn't manage tasks running in the SCHED_IDLE class, I'm wondering if we're missing some logic to reserve CPU bandwidth to the tasks running in the lower scheduling class.

Edit: tasks running in SCHED_IDLE class are indeed managed by scx_rustland, so it definitely seems a scx_rustland issue.

arighi commented 3 weeks ago

Apparently b36ae4fd82466237f3c5408051bf413b10a56022 seems to prevent (mitigate?) the stall condition.

This has been tested extensively by the original bug reporter (from the CachyOS sched_ext discord channel):

SoulHarsh007: ~9.2H testing rustland, with load average: 11.18, 11.23, 11.29, no stall log,
even idled for like 2 to 3 hours before stressing

According to my tests I haven't noticed any performance / responsiveness regression with the last patch applied.

@htejun do you see any potential problem/dowside if we always kick the CPUs without __COMPAT_SCX_KICK_IDLE after dispatch?

htejun commented 3 weeks ago

That should be fine although it can lead to extra scheduling cycles. I wonder why removing KICK_IDLE fixes the issue tho. It could either be a bug in KICK_IDLE implementation or rustland. Are the stalls only reproducible with SCHE_IDLE tasks?

arighi commented 3 weeks ago

@htejun apparently they seem to happen only with SCHED_IDLE tasks, at least according to the traces that I've seen (and unfortunately I can't reproduce the problem locally, only a user from the CachyOS community is experiencing this issue).

But I'm not sure if the SCHED_IDLE aspect is just a coincidence. The task is dispatched correctly to a per-cpu DSQ (we can see it from the exit trace) , it's just not consumed within the watchdog timeout, that is a bit odd, because theoretically in rustland_dispatch() we consume all tasks queued in the corresponding per-CPU DSQ, so it seems that the CPU is not "kicked" in some cases after the dispatch happens.

arighi commented 3 weeks ago

This seems to be fixed by #361, closing for now.

arighi commented 2 weeks ago

I'm re-opening this one, because the stall issue is still happening.

I've updated the description and the title to better clarify the problem. First of all it is not happening only with SCHED_IDLE tasks.

To focus more at the real issue, this is the relevant part that seems to happen in all the traces, for example:

  R plasmashel:gl0[2471] -6928ms
      scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x15 dsq_vtime=30950371873
      cpus=038038

As we can see the task has been dispatched to a per-CPU DSQ (in this case the one associated to CPU 0x15), but the task has been sitting there for almost 7s, triggering the sched_ext watchdog.

That is quite odd, because every time we dispatch a task to a per-CPU DSQ we immediately kick the CPU via scx_bpf_kick_cpu(), that should trigger the dispatch callback to consume the pending tasks from the corresponding per-CPU DSQ.

This issue has been triggered even with this special version https://github.com/sched-ext/scx/tree/rustland-stall, that (as a test) is always using scx_bpf_kick_cpu without SCX_KICK_IDLE, to exclude any potential race at detecting the idle state of the CPU.

Moreover, it also tries to consume all the tasks dispatched to the per-CPU DSQ at once, just to make sure we're not leaving any task in the DSQ (as we can see here https://github.com/sched-ext/scx/blob/rustland-stall/rust/scx_rustland_core/assets/bpf/main.bpf.c#L742).

In addition to all of the above, I even added a periodic timer action (running every 10ms) that is explicitly kicking the CPUs that still have pending tasks in their DSQ, as we can see here: https://github.com/sched-ext/scx/blob/rustland-stall/rust/scx_rustland_core/assets/bpf/main.bpf.c#L958

Despite all these changes we are still getting the same stalls and they all look like the one reported above. So, I'm wondering if we're having some issues with scx_bpf_kick_cpu() itself.

@htejun @Byte-Lab am I missing something obvious here?

arighi commented 2 weeks ago

Nevermind... I was definitely missing something obvious: the assigned DSQ is not a valid one according to the cpumask. Definitely a bug in rustland! Cool, will fix that, sorry for the noise.

arighi commented 2 weeks ago

This should be fixed (for real this time) by #380.

sched-ext / scx

scx_rustland stalling tasks dispatched in the per-CPU DSQs #353