microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.26k stars 812 forks source link

Frequent RCU Stalls on ARM64 #11210

Closed PeronGH closed 6 months ago

PeronGH commented 6 months ago

Windows Version

Microsoft Windows [Version 10.0.22635.3209]

WSL Version

2.1.3.0

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.146.1-2

Distro Version

Debian GNU/Linux trixie trixie/sid aarch64

Other Software

htop 3.3.0 GNU bash, version 5.2.21(1)-release (aarch64-unknown-linux-gnu) vscode 1.86.2 (Actually not specifc to any of them)

Repro Steps

The trigger is not clear, but it seems related to a process using too much resources or system sleep. It is triggered multilpe times a day, sometimes when compiling some Rust projects, somethimes when closing VS Code. It seems not related to any specific setup, as I encoutered it in the fresh install of different distros, kernels and configs.

Expected Behavior

This should not happen.

Actual Behavior

The process will suddenly become not responsive, and uses 100% CPU. It is also impossible to kill the process, the only way to stop it is to shutdown WSL. I've already tried different distros, but the bug persists on both Ubuntu and Debian. I also tried different branches of Debian, including the testing branch I'm using now. And I used different shells too, zsh is also unable to avoid the bug. I even tried different kernels, including Nevuly/WSL2-Linux-Kernel-Rolling-LTS, but it cannot resolve the issue too. I also disbaled WSLg, swap, systemd and everything suspicious, still failed to solve the issue.

Here is the screenshot of this time:

image

Here are previous screenshots:

image (Caused by a process in WSLg container)

image

Diagnostic Logs

I tried attaching to it using strace, but the strace became not responsive too. I also tried lsof, which also became not responsive and unable to be killed.

OneBlue commented 6 months ago

That's interesting. Can you try to have an strace running on that process and have its output redirected to a file, then reproduce the problem with the high CPU usage, and then kill the process and see if you get a proper strace that way ?

PeronGH commented 6 months ago

That's interesting. Can you try to have an strace running on that process and have its output redirected to a file, then reproduce the problem with the high CPU usage, and then kill the process and see if you get a proper strace that way ?

I haven't found a way to reproduce the issue reliablely. It is too random. And I just encoutered it again while using VS Code. Some processes of VS Code stopped responding and took 100% CPU. I tried running strace with stdout redirected to a file, the strace was not responsive again, and I could not quit it using control + c. Then I tried starting another shell tab, but the shell didn't start too. So I can only run wsl --shutdown, however, this has took minutes and still unable to finish. the VmmemWSL is still taking 100% of a few CPU cores and I guess my only choice is to restart my PC.

PeronGH commented 6 months ago

After rebooting and entering WSL again, I didn't find the file to be redirected to. I will see what else info I can collect next time I encounter it.

ericblade commented 6 months ago

I have been encountering something similar when running bitbake, if it runs long enough, eventually WSL grinds to a halt, consuming all available CPUs given to it, and nothing actually runs. wsl --shutdown is the only way I've found to recover, once it's begun, as there's no way to run any diagnostics inside WSL at that point.

PeronGH commented 6 months ago

After some investigation, I find that any process that attaches to the process not responsive will be stuck, and can't be quitted using conrol + c. But I can still quit it using kill command.

I'm looking for some methods that can collect diagnostic logs without attaching to the process.

maxboone commented 6 months ago

@PeronGH do you see RCU stalls in your kernel logs / dmesg when this happens?

PeronGH commented 6 months ago

@PeronGH do you see RCU stalls in your kernel logs / dmesg when this happens?

Yes, I do. This happened when I wake up the system from sleep:

Mar 09 17:33:53 Peron-Book kernel:  ____sys_recvmsg+0x74/0x188
Mar 09 17:33:53 Peron-Book kernel:  ___sys_recvmsg+0x7c/0xe8
Mar 09 17:33:53 Peron-Book kernel:  __sys_recvmsg+0x58/0xc8
Mar 09 17:33:53 Peron-Book kernel:  __arm64_sys_recvmsg+0x28/0x50
Mar 09 17:33:53 Peron-Book kernel:  invoke_syscall.constprop.0+0x54/0x128
Mar 09 17:33:53 Peron-Book kernel:  do_el0_svc+0xcc/0xf0
Mar 09 17:33:53 Peron-Book kernel:  el0_svc+0x24/0xb0
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync_handler+0x138/0x148
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync+0x14c/0x150
Mar 09 17:33:53 Peron-Book kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mar 09 17:33:53 Peron-Book kernel: rcu:         7-....: (239894 ticks this GP) idle=5904/1/0x4000000000000000 softirq=4723/4723 fqs=115724
Mar 09 17:33:53 Peron-Book kernel: rcu:         (t=240034 jiffies g=12009 q=734 ncpus=8)
Mar 09 17:33:53 Peron-Book kernel: CPU: 7 PID: 1190 Comm: systemd-journal Not tainted 6.7.9-WSL2-STABLE+ #1
Mar 09 17:33:53 Peron-Book kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mar 09 17:33:53 Peron-Book kernel: pc : __arch_copy_to_user+0x104/0x240
Mar 09 17:33:53 Peron-Book kernel: lr : _copy_to_iter+0xe8/0x558
Mar 09 17:33:53 Peron-Book kernel: sp : ffff80008dfa3910
Mar 09 17:33:53 Peron-Book kernel: x29: ffff80008dfa3910 x28: ffff0001b8cc5880 x27: 0000000000000065
Mar 09 17:33:53 Peron-Book kernel: x26: 0000000000006007 x25: ffff000002246800 x24: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x26: 0000000000006007 x25: ffff000002246800 x24: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x23: 0000000000000065 x22: ffff000000f1f300 x21: ffff80008dfa3dc0
Mar 09 17:33:53 Peron-Book kernel: x20: ffff000002246800 x19: 0000000000000065 x18: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x17: 0000000000000000 x16: 0000000000000000 x15: ffff000002246800
Mar 09 17:33:53 Peron-Book kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffff80008032c618
Mar 09 17:33:53 Peron-Book kernel: x8 : 37313a3731203920 x7 : 2072614d3e36383c x6 : 0000ab81c89cc210
Mar 09 17:33:53 Peron-Book kernel: x5 : 0000ab81c89cc275 x4 : 0000000000000000 x3 : 0000000000000000
Mar 09 17:33:53 Peron-Book kernel: x2 : ffffffffffffffe5 x1 : ffff000002246810 x0 : 0000ab81c89cc210
Mar 09 17:33:53 Peron-Book kernel: Call trace:
Mar 09 17:33:53 Peron-Book kernel:  __arch_copy_to_user+0x104/0x240
Mar 09 17:33:53 Peron-Book kernel:  simple_copy_to_iter+0x48/0x98
Mar 09 17:33:53 Peron-Book kernel:  __skb_datagram_iter+0x7c/0x280
Mar 09 17:33:53 Peron-Book kernel:  skb_copy_datagram_iter+0x48/0xc8
Mar 09 17:33:53 Peron-Book kernel:  __unix_dgram_recvmsg+0x164/0x4f0
Mar 09 17:33:53 Peron-Book kernel:  unix_dgram_recvmsg+0x5c/0x88
Mar 09 17:33:53 Peron-Book kernel:  sock_recvmsg+0x88/0x108
Mar 09 17:33:53 Peron-Book kernel:  ____sys_recvmsg+0x74/0x188
Mar 09 17:33:53 Peron-Book kernel:  ___sys_recvmsg+0x7c/0xe8
Mar 09 17:33:53 Peron-Book kernel:  __sys_recvmsg+0x58/0xc8
Mar 09 17:33:53 Peron-Book kernel:  __arm64_sys_recvmsg+0x28/0x50
Mar 09 17:33:53 Peron-Book kernel:  invoke_syscall.constprop.0+0x54/0x128
Mar 09 17:33:53 Peron-Book kernel:  do_el0_svc+0xcc/0xf0
Mar 09 17:33:53 Peron-Book kernel:  el0_svc+0x24/0xb0
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync_handler+0x138/0x148
Mar 09 17:33:53 Peron-Book kernel:  el0t_64_sync+0x14c/0x150
Mar 09 17:33:53 Peron-Book systemd-journald[444]: Collecting audit messages is disabled.
Mar 09 17:33:53 Peron-Book systemd-journald[444]: Journal started
Mar 09 17:33:53 Peron-Book systemd-journald[444]: System Journal (/var/log/journal/afbad49b043d473bb637af651af5b1dc) is 490.5M, max 4.0G, 3.5G free.
Mar 09 17:17:01 Peron-Book CRON[440]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Mar 09 17:19:51 Peron-Book systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!
Mar 09 17:17:01 Peron-Book CRON[439]: pam_unix(cron:session): session closed for user root
Mar 09 17:19:51 Peron-Book systemd[1]: systemd-journald.service: Killing process 36 (systemd-journal) with signal SIGABRT.
Mar 09 17:33:53 Peron-Book systemd[1]: apt-daily.service: Deactivated successfully.
Mar 09 17:33:53 Peron-Book systemd[1]: Finished apt-daily.service - Daily apt download activities.

And this is when compiling an Android App:

Mar 09 17:48:40 Peron-Book kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Mar 09 17:48:40 Peron-Book kernel: rcu:         6-....: (14997 ticks this GP) idle=70e4/1/0x4000000000000000 softirq=11943/11943 fqs=7270
Mar 09 17:48:40 Peron-Book kernel: rcu:         (t=15000 jiffies g=31397 q=994318 ncpus=8)
Mar 09 17:48:40 Peron-Book kernel: CPU: 6 PID: 3354 Comm: Monitor Deflati Not tainted 6.7.9-WSL2-STABLE+ #1
Mar 09 17:48:40 Peron-Book kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Mar 09 17:48:40 Peron-Book kernel: pc : clear_rseq_cs.isra.0+0x20/0x38
Mar 09 17:48:40 Peron-Book kernel: lr : __rseq_handle_notify_resume+0x16c/0x490
Mar 09 17:48:40 Peron-Book kernel: sp : ffff80008b9abc80
Mar 09 17:48:40 Peron-Book kernel: x29: ffff80008b9abc80 x28: 000000000000080c x27: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x23: 0000ff7fc087dbb8 x22: 0000000000000000 x21: 00000000ffffffff
Mar 09 17:48:40 Peron-Book kernel: x20: ffff80008b9abeb0 x19: ffff000040f1d880 x18: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000ff0b6520c790
Mar 09 17:48:40 Peron-Book kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffff80008025728c
Mar 09 17:48:40 Peron-Book kernel: x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000fffffffffff8
Mar 09 17:48:40 Peron-Book kernel: x2 : 0000000000000000 x1 : 0000ff7f9520d8c8 x0 : 0000000000000000
Mar 09 17:48:40 Peron-Book kernel: Call trace:
Mar 09 17:48:40 Peron-Book kernel:  clear_rseq_cs.isra.0+0x20/0x38
Mar 09 17:48:40 Peron-Book kernel:  do_notify_resume+0x1cc/0xe68
Mar 09 17:48:40 Peron-Book kernel:  el0_svc+0x90/0xb0
Mar 09 17:48:40 Peron-Book kernel:  el0t_64_sync_handler+0x138/0x148
Mar 09 17:48:40 Peron-Book kernel:  el0t_64_sync+0x14c/0x150

The kernel is https://github.com/Nevuly/WSL2-Linux-Kernel-Rolling/releases/tag/linux-wsl-stable-6.7.9 but the issue happened on the official WSL kernel too. If required, I can switch back to official kernel and wait for the issue to happen in order to collect logs.

PeronGH commented 6 months ago

Closed in favour of #11274

maxboone commented 6 months ago

Alright, well unfortunately, welcome to the club. You can try compiling the kernel with CONFIG_RSEQ disabled which seriously alleviates the problem, and on a Hyper-V VM this doesn't happen (however you'll miss all the WSL2 magic).

I see (while typing this reply) that you found my latest issue - hopefully the Microsoft team will respond there soon!