sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
871 stars 81 forks source link

scx_bpfland: random stalls #674

Open SoulHarsh007 opened 4 weeks ago

SoulHarsh007 commented 4 weeks ago

:wave: I am using scx-scheds-git from CachyOS V3 repository. The build version is: 1.0.4.r74.ga3cc4c2-1

Description: After boot, in the first 10–12 minutes, scx_bpfland stalls and causes the session to freeze, but I am able to trigger a power off by pressing the power off button (not a hard reset, just pressing the power button triggers the shutdown sequence). The test machine has an AMD Ryzen 5500H (6C/12T).

Relevant logs:

Sep 23 13:02:57 NotXonoticP1 kernel: sched_ext: BPF scheduler "bpfland" disabled (runnable task stall)
Sep 23 13:02:57 NotXonoticP1 kernel: sched_ext: bpfland: kworker/10:1[105] failed to run for 7.040s
Sep 23 13:02:57 NotXonoticP1 kernel:    scx_watchdog_workfn+0x18d/0x220
Sep 23 13:02:57 NotXonoticP1 kernel:    process_one_work+0x192/0x350
Sep 23 13:02:57 NotXonoticP1 kernel:    worker_thread+0x24c/0x380
Sep 23 13:02:57 NotXonoticP1 kernel:    kthread+0xd2/0x100
Sep 23 13:02:57 NotXonoticP1 kernel:    ret_from_fork+0x34/0x50
Sep 23 13:02:57 NotXonoticP1 kernel:    ret_from_fork_asm+0x1a/0x30
Sep 23 13:02:58 NotXonoticP1 bash[1052]: DEBUG DUMP
Sep 23 13:02:58 NotXonoticP1 bash[1052]: ================================================================================
Sep 23 13:02:58 NotXonoticP1 bash[1052]: kworker/u48:11[32974] triggered exit kind 1026:
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   runnable task stall (kworker/10:1[105] failed to run for 7.040s)
Sep 23 13:02:58 NotXonoticP1 bash[1052]: Backtrace:
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   scx_watchdog_workfn+0x18d/0x220
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   process_one_work+0x192/0x350
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   worker_thread+0x24c/0x380
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   kthread+0xd2/0x100
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   ret_from_fork+0x34/0x50
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   ret_from_fork_asm+0x1a/0x30
Sep 23 13:02:58 NotXonoticP1 bash[1052]: CPU states
Sep 23 13:02:58 NotXonoticP1 bash[1052]: ----------
Sep 23 13:02:58 NotXonoticP1 bash[1052]: CPU 8   : nr_run=1 flags=0x1 cpu_rel=0 ops_qseq=19999 pnt_seq=239865
Sep 23 13:02:58 NotXonoticP1 bash[1052]:           curr=kworker/u48:11[32974] class=ext_sched_class
Sep 23 13:02:58 NotXonoticP1 bash[1052]:  *R kworker/u48:11[32974] +0ms
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       scx_state/flags=3/0xd dsq_flags=0x0 ops_state/qseq=0/0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       sticky/holding_cpu=-1/-1 dsq_id=(n/a) dsq_vtime=368488863894
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       cpus=fff
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     scx_dump_state+0x70d/0x8d0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     scx_ops_error_irq_workfn+0x40/0x50
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     irq_work_run_list+0x53/0x90
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     irq_work_run+0x18/0x50
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     __sysvec_irq_work+0x1c/0xb0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     sysvec_irq_work+0x6c/0x90
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     asm_sysvec_irq_work+0x1a/0x20
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     scx_watchdog_workfn+0x1a6/0x220
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     process_one_work+0x192/0x350
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     worker_thread+0x24c/0x380
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     kthread+0xd2/0x100
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     ret_from_fork+0x34/0x50
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     ret_from_fork_asm+0x1a/0x30
Sep 23 13:02:58 NotXonoticP1 bash[1052]: CPU 10  : nr_run=3 flags=0x1 cpu_rel=0 ops_qseq=24696 pnt_seq=296208
Sep 23 13:02:58 NotXonoticP1 bash[1052]:           curr=sway[1796] class=rt_sched_class
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   R kworker/10:1[105] -7040ms
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       sticky/holding_cpu=-1/-1 dsq_id=0xd dsq_vtime=365139714289
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       cpus=400
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     kthread+0xd2/0x100
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     ret_from_fork+0x34/0x50
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     ret_from_fork_asm+0x1a/0x30
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   R kworker/10:1H[119] -1408ms
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       scx_state/flags=3/0x9 dsq_flags=0x1 ops_state/qseq=0/0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       sticky/holding_cpu=-1/-1 dsq_id=0xd dsq_vtime=367884478134
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       cpus=400
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     kthread+0xd2/0x100
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     ret_from_fork+0x34/0x50
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     ret_from_fork_asm+0x1a/0x30
Sep 23 13:02:58 NotXonoticP1 bash[1052]:   R Xwayland[2999] -1ms
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       scx_state/flags=3/0x9 dsq_flags=0x0 ops_state/qseq=0/0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=365090032150
Sep 23 13:02:58 NotXonoticP1 bash[1052]:       cpus=fff
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     sock_def_readable+0x42/0xc0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     unix_stream_sendmsg+0x26e/0x6f0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     sock_write_iter+0x199/0x1a0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     do_iter_readv_writev+0x130/0x240
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     vfs_writev+0x177/0x3c0
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     __x64_sys_writev+0xec/0x140
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     do_syscall_64+0x82/0x190
Sep 23 13:02:58 NotXonoticP1 bash[1052]:     entry_SYSCALL_64_after_hwframe+0x76/0x7e
Sep 23 13:02:58 NotXonoticP1 bash[1052]: ================================================================================
Sep 23 13:02:58 NotXonoticP1 bash[1052]: 07:32:58 [INFO] Unregister scx_bpfland scheduler
Sep 23 13:02:58 NotXonoticP1 bash[1052]: Error: EXIT: runnable task stall (kworker/10:1[105] failed to run for 7.040s)
Sep 23 13:02:58 NotXonoticP1 systemd[1]: scx.service: Main process exited, code=exited, status=1/FAILURE
Sep 23 13:02:58 NotXonoticP1 systemd[1]: scx.service: Failed with result 'exit-code'.
Sep 23 13:02:58 NotXonoticP1 systemd[1]: scx.service: Scheduled restart job, restart counter is at 1.
Sep 23 13:02:58 NotXonoticP1 systemd[1]: Started Start scx_scheduler.
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] scx_bpfland 1.0.4-ga3cc4c2-dirty x86_64-unknown-linux-gnu SMT on
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] primary CPU domain = 0xfff
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] cpufreq performance level: auto
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] L2 cache ID 1: sibling CPUs: [2, 3]
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] L2 cache ID 2: sibling CPUs: [4, 5]
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] L2 cache ID 4: sibling CPUs: [8, 9]
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] L2 cache ID 5: sibling CPUs: [10, 11]
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] L2 cache ID 0: sibling CPUs: [0, 1]
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] L2 cache ID 3: sibling CPUs: [6, 7]
Sep 23 13:02:58 NotXonoticP1 bash[42807]: 07:32:58 [INFO] L3 cache ID 0: sibling CPUs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Sep 23 13:02:58 NotXonoticP1 kernel: sched_ext: "bpfland" does not implement cgroup cpu.weight
Sep 23 13:02:58 NotXonoticP1 kernel: sched_ext: BPF scheduler "bpfland" enabled
Sep 23 13:03:01 NotXonoticP1 warp-svc[1057]: 2024-09-23T07:33:01.470Z DEBUG warp::warp_service::network_change: Routes changed:
Sep 23 13:03:01 NotXonoticP1 warp-svc[1057]: NewNeighbour; Destination: 2401:4900:53f8:af97:3892:2fff:fe8e:83ba;
Sep 23 13:03:11 NotXonoticP1 systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Sep 23 13:03:18 NotXonoticP1 systemd-logind[1054]: Power key pressed short.
Sep 23 13:03:18 NotXonoticP1 systemd-logind[1054]: Powering off...
Sep 23 13:03:18 NotXonoticP1 systemd-logind[1054]: System is powering down.
ptr1337 commented 4 weeks ago

Please test with https://github.com/sched-ext/scx/commit/1b44eca9d54d1fe02faa81722d5f105f0bd50b9a

arighi commented 4 weeks ago

Can you post the output of lscpu -e?

SoulHarsh007 commented 3 weeks ago

Apologies for the late response, here is the lscpu -e output:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ       MHZ
  0    0      0    0 0:0:0:0          yes 4280.0000 400.0000 1612.9460
  1    0      0    0 0:0:0:0          yes 4280.0000 400.0000  400.0000
  2    0      0    1 1:1:1:0          yes 4280.0000 400.0000 1676.8750
  3    0      0    1 1:1:1:0          yes 4280.0000 400.0000 2096.2109
  4    0      0    2 2:2:2:0          yes 4280.0000 400.0000 1675.5720
  5    0      0    2 2:2:2:0          yes 4280.0000 400.0000  400.0000
  6    0      0    3 3:3:3:0          yes 4280.0000 400.0000 2093.4351
  7    0      0    3 3:3:3:0          yes 4280.0000 400.0000 2096.2490
  8    0      0    4 4:4:4:0          yes 4280.0000 400.0000 1761.4611
  9    0      0    4 4:4:4:0          yes 4280.0000 400.0000 2089.0779
 10    0      0    5 5:5:5:0          yes 4280.0000 400.0000 1869.4690
 11    0      0    5 5:5:5:0          yes 4280.0000 400.0000 1674.8400

PS: I am yet to test with commit: 1b44eca