Closed PeronGH closed 6 months ago
That's interesting. Can you try to have an strace
running on that process and have its output redirected to a file, then reproduce the problem with the high CPU usage, and then kill the process and see if you get a proper strace that way ?
That's interesting. Can you try to have an
strace
running on that process and have its output redirected to a file, then reproduce the problem with the high CPU usage, and then kill the process and see if you get a proper strace that way ?
I haven't found a way to reproduce the issue reliablely. It is too random. And I just encoutered it again while using VS Code. Some processes of VS Code stopped responding and took 100% CPU. I tried running strace
with stdout redirected to a file, the strace
was not responsive again, and I could not quit it using control + c
. Then I tried starting another shell tab, but the shell didn't start too. So I can only run wsl --shutdown
, however, this has took minutes and still unable to finish. the VmmemWSL
is still taking 100% of a few CPU cores and I guess my only choice is to restart my PC.
After rebooting and entering WSL again, I didn't find the file to be redirected to. I will see what else info I can collect next time I encounter it.
I have been encountering something similar when running bitbake, if it runs long enough, eventually WSL grinds to a halt, consuming all available CPUs given to it, and nothing actually runs. wsl --shutdown is the only way I've found to recover, once it's begun, as there's no way to run any diagnostics inside WSL at that point.
After some investigation, I find that any process that attaches to the process not responsive will be stuck, and can't be quitted using conrol + c
. But I can still quit it using kill
command.
I'm looking for some methods that can collect diagnostic logs without attaching to the process.
@PeronGH do you see RCU stalls in your kernel logs / dmesg when this happens?
@PeronGH do you see RCU stalls in your kernel logs / dmesg when this happens?
Yes, I do. This happened when I wake up the system from sleep:
Mar 09 17:33:53 Peron-Book kernel: ____sys_recvmsg+0x74/0x188
Mar 09 17:33:53 Peron-Book kernel: ___sys_recvmsg+0x7c/0xe8
Mar 09 17:33:53 Peron-Book kernel: __sys_recvmsg+0x58/0xc8
Mar 09 17:33:53 Peron-Book kernel: __arm64_sys_recvmsg+0x28/0x50
Mar 09 17:33:53 Peron-Book kernel: invoke_syscall.constprop.0+0x54/0x128
Mar 09 17:33:53 Peron-Book kernel: do_el0_svc+0xcc/0xf0
Mar 09 17:33:53 Peron-Book kernel: el0_svc+0x24/0xb0
Mar 09 17:33:53 Peron-Book kernel: el0t_64_sync_handler+0x138/0x148
Mar 09 17:33:53 Peron-Book kernel: el0t_64_sync+0x14c/0x150
Mar 09 17:33:53 Peron-Book kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mar 09 17:33:53 Peron-Book kernel: rcu: 7-....: (239894 ticks this GP) idle=5904/1/0x4000000000000000 softirq=4723/4723 fqs=115724
Mar 09 17:33:53 Peron-Book kernel: rcu: (t=240034 jiffies g=12009 q=734 ncpus=8)
Mar 09 17:33:53 Peron-Book kernel: CPU: 7 PID: 1190 Comm: systemd-journal Not tainted 6.7.9-WSL2-STABLE+ #1
Mar 09 17:33:53 Peron-Book kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mar 09 17:33:53 Peron-Book kernel: pc : __arch_copy_to_user+0x104/0x240
Mar 09 17:33:53 Peron-Book kernel: lr : _copy_to_iter+0xe8/0x558
Mar 09 17:33:53 Peron-Book kernel: sp : ffff80008dfa3910
Mar 09 17:33:53 Peron-Book kernel: x29: ffff80008dfa3910 x28: ffff0001b8cc5880 x27: 0000000000000065
Mar 09 17:33:53 Peron-Book kernel: x26: 0000000000006007 x25: ffff000002246800 x24: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x26: 0000000000006007 x25: ffff000002246800 x24: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x23: 0000000000000065 x22: ffff000000f1f300 x21: ffff80008dfa3dc0
Mar 09 17:33:53 Peron-Book kernel: x20: ffff000002246800 x19: 0000000000000065 x18: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x17: 0000000000000000 x16: 0000000000000000 x15: ffff000002246800
Mar 09 17:33:53 Peron-Book kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffff80008032c618
Mar 09 17:33:53 Peron-Book kernel: x8 : 37313a3731203920 x7 : 2072614d3e36383c x6 : 0000ab81c89cc210
Mar 09 17:33:53 Peron-Book kernel: x5 : 0000ab81c89cc275 x4 : 0000000000000000 x3 : 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x2 : ffffffffffffffe5 x1 : ffff000002246810 x0 : 0000ab81c89cc210
Mar 09 17:33:53 Peron-Book kernel: Call trace:
Mar 09 17:33:53 Peron-Book kernel: __arch_copy_to_user+0x104/0x240
Mar 09 17:33:53 Peron-Book kernel: simple_copy_to_iter+0x48/0x98
Mar 09 17:33:53 Peron-Book kernel: __skb_datagram_iter+0x7c/0x280
Mar 09 17:33:53 Peron-Book kernel: skb_copy_datagram_iter+0x48/0xc8
Mar 09 17:33:53 Peron-Book kernel: __unix_dgram_recvmsg+0x164/0x4f0
Mar 09 17:33:53 Peron-Book kernel: unix_dgram_recvmsg+0x5c/0x88
Mar 09 17:33:53 Peron-Book kernel: sock_recvmsg+0x88/0x108
Mar 09 17:33:53 Peron-Book kernel: ____sys_recvmsg+0x74/0x188
Mar 09 17:33:53 Peron-Book kernel: ___sys_recvmsg+0x7c/0xe8
Mar 09 17:33:53 Peron-Book kernel: __sys_recvmsg+0x58/0xc8
Mar 09 17:33:53 Peron-Book kernel: __arm64_sys_recvmsg+0x28/0x50
Mar 09 17:33:53 Peron-Book kernel: invoke_syscall.constprop.0+0x54/0x128
Mar 09 17:33:53 Peron-Book kernel: do_el0_svc+0xcc/0xf0
Mar 09 17:33:53 Peron-Book kernel: el0_svc+0x24/0xb0
Mar 09 17:33:53 Peron-Book kernel: el0t_64_sync_handler+0x138/0x148
Mar 09 17:33:53 Peron-Book kernel: el0t_64_sync+0x14c/0x150
Mar 09 17:33:53 Peron-Book systemd-journald[444]: Collecting audit messages is disabled.
Mar 09 17:33:53 Peron-Book systemd-journald[444]: Journal started
Mar 09 17:33:53 Peron-Book systemd-journald[444]: System Journal (/var/log/journal/afbad49b043d473bb637af651af5b1dc) is 490.5M, max 4.0G, 3.5G free.
Mar 09 17:17:01 Peron-Book CRON[440]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Mar 09 17:19:51 Peron-Book systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
Mar 09 17:17:01 Peron-Book CRON[439]: pam_unix(cron:session): session closed for user root
Mar 09 17:19:51 Peron-Book systemd[1]: systemd-journald.service: Killing process 36 (systemd-journal) with signal SIGABRT.
Mar 09 17:33:53 Peron-Book systemd[1]: apt-daily.service: Deactivated successfully.
Mar 09 17:33:53 Peron-Book systemd[1]: Finished apt-daily.service - Daily apt download activities.
And this is when compiling an Android App:
Mar 09 17:48:40 Peron-Book kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mar 09 17:48:40 Peron-Book kernel: rcu: 6-....: (14997 ticks this GP) idle=70e4/1/0x4000000000000000 softirq=11943/11943 fqs=7270
Mar 09 17:48:40 Peron-Book kernel: rcu: (t=15000 jiffies g=31397 q=994318 ncpus=8)
Mar 09 17:48:40 Peron-Book kernel: CPU: 6 PID: 3354 Comm: Monitor Deflati Not tainted 6.7.9-WSL2-STABLE+ #1
Mar 09 17:48:40 Peron-Book kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mar 09 17:48:40 Peron-Book kernel: pc : clear_rseq_cs.isra.0+0x20/0x38
Mar 09 17:48:40 Peron-Book kernel: lr : __rseq_handle_notify_resume+0x16c/0x490
Mar 09 17:48:40 Peron-Book kernel: sp : ffff80008b9abc80
Mar 09 17:48:40 Peron-Book kernel: x29: ffff80008b9abc80 x28: 000000000000080c x27: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x23: 0000ff7fc087dbb8 x22: 0000000000000000 x21: 00000000ffffffff
Mar 09 17:48:40 Peron-Book kernel: x20: ffff80008b9abeb0 x19: ffff000040f1d880 x18: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000ff0b6520c790
Mar 09 17:48:40 Peron-Book kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffff80008025728c
Mar 09 17:48:40 Peron-Book kernel: x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000fffffffffff8
Mar 09 17:48:40 Peron-Book kernel: x2 : 0000000000000000 x1 : 0000ff7f9520d8c8 x0 : 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: Call trace:
Mar 09 17:48:40 Peron-Book kernel: clear_rseq_cs.isra.0+0x20/0x38
Mar 09 17:48:40 Peron-Book kernel: do_notify_resume+0x1cc/0xe68
Mar 09 17:48:40 Peron-Book kernel: el0_svc+0x90/0xb0
Mar 09 17:48:40 Peron-Book kernel: el0t_64_sync_handler+0x138/0x148
Mar 09 17:48:40 Peron-Book kernel: el0t_64_sync+0x14c/0x150
The kernel is https://github.com/Nevuly/WSL2-Linux-Kernel-Rolling/releases/tag/linux-wsl-stable-6.7.9 but the issue happened on the official WSL kernel too. If required, I can switch back to official kernel and wait for the issue to happen in order to collect logs.
Closed in favour of #11274
Alright, well unfortunately, welcome to the club. You can try compiling the kernel with CONFIG_RSEQ disabled which seriously alleviates the problem, and on a Hyper-V VM this doesn't happen (however you'll miss all the WSL2 magic).
I see (while typing this reply) that you found my latest issue - hopefully the Microsoft team will respond there soon!
Windows Version
Microsoft Windows [Version 10.0.22635.3209]
WSL Version
2.1.3.0
Are you using WSL 1 or WSL 2?
Kernel Version
5.15.146.1-2
Distro Version
Debian GNU/Linux trixie trixie/sid aarch64
Other Software
htop 3.3.0 GNU bash, version 5.2.21(1)-release (aarch64-unknown-linux-gnu) vscode 1.86.2 (Actually not specifc to any of them)
Repro Steps
The trigger is not clear, but it seems related to a process using too much resources or system sleep. It is triggered multilpe times a day, sometimes when compiling some Rust projects, somethimes when closing VS Code. It seems not related to any specific setup, as I encoutered it in the fresh install of different distros, kernels and configs.
Expected Behavior
This should not happen.
Actual Behavior
The process will suddenly become not responsive, and uses 100% CPU. It is also impossible to kill the process, the only way to stop it is to shutdown WSL. I've already tried different distros, but the bug persists on both Ubuntu and Debian. I also tried different branches of Debian, including the
testing
branch I'm using now. And I used different shells too,zsh
is also unable to avoid the bug. I even tried different kernels, including Nevuly/WSL2-Linux-Kernel-Rolling-LTS, but it cannot resolve the issue too. I also disbaled WSLg, swap, systemd and everything suspicious, still failed to solve the issue.Here is the screenshot of this time:
Here are previous screenshots:
(Caused by a process in WSLg container)
Diagnostic Logs
I tried attaching to it using
strace
, but thestrace
became not responsive too. I also triedlsof
, which also became not responsive and unable to be killed.