WSL2 intermittently slows/hangs on Windows on ARM

david-nordvall commented 8 months ago

Windows Version

10.0.22621.2428

WSL Version

1.2.5.0

Are you using WSL 1 or WSL 2?

[X] WSL 2
[ ] WSL 1

Kernel Version

5.15.90.1

Distro Version

Ubuntu 22.04

Other Software

Docker Engine (installed in WSL following the instructions for installing on Ubuntu). Not Docker Desktop.
VS Code with the Dev Containers extensions, for enabling development within a Docker container.
.NET 7 SDK, installed in the development Docker container started by VS Code's Dev Containers extension.
MongoDB running in Docker container in Ubuntu on WSL2.
OpenTelemetryCollector running in Docker container in Ubuntu on WSL2.

Repro Steps

Start a VS Code remote development environment with a development Docker container running on Docker Engine on Ubuntu on WSL2. Start working/compiling/debugging and observe that the WSL2 environment, including all Docker containers running on WSL2, will start to slow down/hang with high CPU usage after a short while (specifically the VmmemWSL process). The environment will completely hang for a number of seconds, then continue working for a while.

The problems not only affect the Docker environment but also "native WSL2" shells/environments.

The problems get increasingly worse (more frequent hangs which lasts for longer and longer time) until the WSL2 is completely unusable. Most of the time, running wsl --shutdown will shut WSL2 down, after which it can be restarted, and the slow/hang cycle can begin again. Note that running wsl --shutdown takes abnormally long to execute when this issue has been encountered (30s to 10 min).

More complex development environments (for example multiple code projects compiling in parallell, development databases and other software running in containers) will speed up the slow/hang cycle.

This issue is created per request in issue #9454.

Expected Behavior

The VS Code remote development environment will not slow down/start to hang over time and will continue using "normal" CPU cycles.

Actual Behavior

The VS Code remove development environment, and WSL2 as a whole, starts to slow down/hang with increasing frequency.

Diagnostic Logs

WSL dumps collected during a hang is available at https://doits-my.sharepoint.com/:u:/g/personal/david_nordvall_doitsolutions_se/EedqfOgnSnJOtLAKe7Jd2R0BqtdtieIthWXUbvLOPBf9zg?e=2N8fuq.

Please note that I got several instances of this error when the dumps were collected:

Line |
 118 |          $Result = $DumpMethod.Invoke($null, @($process.Handle,
     |          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     | You cannot call a method on a null-valued expression.
Failed to write dump for: C:\Users\DavidNordvall\Downloads\wsl-dumps\WslLogs-2023-10-23_19-15-49\dumps/wsl.7204.dmp
Writing C:\Users\DavidNordvall\Downloads\wsl-dumps\WslLogs-2023-10-23_19-15-49\dumps/wsl.12444.dmp

benhillis commented 8 months ago

There is a Windows OS fix that targets an ARM-specific hang that should be going out next month (I think the fourth patch Tuesday of November).

david-nordvall commented 8 months ago

@benhillis That sounds hopeful! Does this fix specifically target hangs in WSL? I've had zero issues with hangs and slow downs outside of WSL.

benhillis commented 8 months ago

@david-nordvall - yes it is a hypervisor fix specific to running virtual machines.

hwine commented 8 months ago

@david-nordvall fyi, I also got the failure messages -- turns out the dump utility in the MS script doesn't work for ARM, but I could get valid dumps with procdump64a.exe as described over here

david-nordvall commented 7 months ago

@david-nordvall - yes it is a hypervisor fix specific to running virtual machines.

@benhillis Do you know if the fix has been released yet. I have installed all updates available for my Surface Pro 9 5G to date but the problem persists. I am currently running the following:

WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.2792

david-nordvall commented 6 months ago

@benhillis One more patch tuesday-patch applied and the issue still persists. Any word on if the hypervisor fix has actually been released? If it has, how can I help you get more data for debugging? This issue really needs to get fixed... :-(

Current WSL version info:

WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.2861

david-nordvall commented 6 months ago

Update: I completely wiped my Surface Pro 9 and re-installed Windows (using the restore functionality built in to Windows). The issue still persists :-(

WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.2861

@benhillis can you confirm or deny if the hypervisor fix you mentioned earlier has been released?

david-nordvall commented 6 months ago

Another patch Tuesday-patch applied and the issue still persists.

WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3007

@benhillis and @craigloewen-msft, I'm happy to provide you with any logs you may need to property debug and fix this issue.

david-nordvall commented 5 months ago

So, I realize Windows on ARM isn't Microsoft's biggest priority. Especially issues specific for developers using Windows on ARM, since there probably are very few such developers. But I'd still love to hear if the Hyper-V fix you mentioned, @benhillis, has been released yet?

Considering that there "soon" will be devices with Qualcomm's new Snapdragon X processor, which might appeal to more developers, and that Docker might release a Docker Desktop version for Windows on ARM during 2024, I'd hope Microsoft would pay a little more attention to this issue, since it pretty much makes Windows on ARM a non-starter as a developer platform, regardless of the processor's underlying performance and efficiency.

Another patch Tuesday-patch applied. The issue still persists:

WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3085

luozf14 commented 5 months ago

Having exactly the same issue. WSL hangs quickly after using VS code remote.

vadika commented 4 months ago

Thinkpad x13s gen1 -- looks like I found the solution, at least it works for me -- no hangs, no excessive cpu load and survives sleep/resume. Version WSL: 2.1.1.0 Version kernel: 5.15.146.1-2 Version WSLg: 1.0.60 Version MSRDC: 1.2.5105 Version Direct3D: 1.611.1-81528511 Version DXCore: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Version Windows: 10.0.22631.3085

I upgraded WSL to pre-release ( wsl --update --pre-release) and disabled DirectX ( /Users/%username% .wslconfig --> [wsl2] gpuSupport=false ) Did it because there were kernel messages about dx interaction in linux kernel log. Restarted WSL and have no problems since (24 hours stable now)

david-nordvall commented 4 months ago

@vadika, that's great! Please keep me posted if the fix works more long term than 24 hours :-). Unfortunately it made no (or little) difference for me on my Surface Pro 9 5G.

WSL-version: 2.1.1.0
Kernelversion: 5.15.146.1-2
WSLg-version: 1.0.60
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3085

I can still make WSL hang and/or use excessive CPU by starting a VS Code dev conatiner with a couple of moderately complex .NET projects. I may be able to work a little longer before WSL starts to freeze, it is difficult to tell. Either way, using a pre-release version of WSL and disabling GPU support doesn't fix the issue for me.

david-nordvall commented 4 months ago

Some logs from dmesg -w. Don't know if this is useful info, @benhillis and @craigloewen-msft?

[  429.728472] rcu: INFO: rcu_sched self-detected stall on CPU
[  429.732010] rcu:     3-....: (14771 ticks this GP) idle=0d1/1/0x4000000000000002 softirq=21325/21325 fqs=6694
[  429.733644]  (t=15002 jiffies g=58457 q=34118)
[  429.733665] Task dump for CPU 3:
[  429.733666] task:.NET Debugger   state:R  running task     stack:    0 pid: 2812 ppid:  1198 flags:0x0000080f
[  429.733670] Call trace:
[  429.733671]  dump_backtrace+0x0/0x1c0
[  429.733762]  show_stack+0x1c/0x24
[  429.733764]  sched_show_task+0x164/0x190
[  429.733783]  dump_cpu_task+0x48/0x54
[  429.733833]  rcu_dump_cpu_stacks+0xec/0x130
[  429.733834]  rcu_sched_clock_irq+0x914/0xa50
[  429.733845]  update_process_times+0xa0/0x190
[  429.733855]  tick_sched_timer+0x5c/0xd0
[  429.733882]  __hrtimer_run_queues+0x140/0x330
[  429.733883]  hrtimer_interrupt+0xf4/0x240
[  429.733885]  hv_stimer0_isr+0x28/0x30
[  429.733927]  hv_stimer0_percpu_isr+0x14/0x20
[  429.733928]  handle_percpu_devid_irq+0x8c/0x1b4
[  429.733937]  handle_domain_irq+0x64/0x90
[  429.733947]  gic_handle_irq+0x58/0x128
[  429.733948]  call_on_irq_stack+0x20/0x38
[  429.733950]  do_interrupt_handler+0x54/0x5c
[  429.733951]  el1_interrupt+0x2c/0x4c
[  429.733962]  el1h_64_irq_handler+0x14/0x20
[  429.733963]  el1h_64_irq+0x74/0x78
[  429.733964]  mm_release+0xd8/0x140
[  429.733983]  exit_mm_release+0x2c/0x40
[  429.733984]  do_exit+0x1a0/0xa70
[  429.733993]  do_group_exit+0x3c/0xa4
[  429.733995]  get_signal+0x1e0/0x9b0
[  429.734005]  do_notify_resume+0x15c/0xebc
[  429.734007]  el0_svc+0x3c/0x4c
[  429.734008]  el0t_64_sync_handler+0x9c/0x120
[  429.734009]  el0t_64_sync+0x158/0x15c
[  451.744411] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 3-... 4-... } 15308 jiffies s: 501 root: 0x18/.
[  451.745440] rcu: blocking rcu_node structures (internal RCU debug):
[  451.745813] Task dump for CPU 3:
[  451.745815] task:.NET Debugger   state:R  running task     stack:    0 pid: 2812 ppid:  1198 flags:0x0000080f
[  451.745819] Call trace:
[  451.745820]  __switch_to+0xb4/0xec
[  451.745827]  el1h_64_irq+0x74/0x78
[  451.745829]  mm_release+0xd8/0x140
[  451.745857]  exit_mm_release+0x2c/0x40
[  451.745858]  do_exit+0x1a0/0xa70
[  451.745860]  do_group_exit+0x3c/0xa4
[  451.745861]  get_signal+0x1e0/0x9b0
[  451.745863]  do_notify_resume+0x15c/0xebc
[  451.745865]  el0_svc+0x3c/0x4c
[  451.745867]  el0t_64_sync_handler+0x9c/0x120
[  451.745869]  el0t_64_sync+0x158/0x15c
[  451.745870] Task dump for CPU 4:
[  451.745870] task:snapd           state:R  running task     stack:    0 pid:  651 ppid:   160 flags:0x0000000a
[  451.745873] Call trace:
[  451.745873]  __switch_to+0xb4/0xec
[  451.745875]  0xffff0000193cc880

> wsl --version
WSL-version: 2.1.1.0
Kernelversion: 5.15.146.1-2
WSLg-version: 1.0.60
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3085

david-nordvall commented 4 months ago

And just to show that it's not always the .NET Debugger task that's affected by CPU stalls (which happened to be the case in both stalls in the logs above), here are some more logs in which snapd is the affected tasks:

[  668.559419] rcu: INFO: rcu_sched self-detected stall on CPU
[  668.561475] rcu:     0-....: (14113 ticks this GP) idle=9f7/1/0x4000000000000002 softirq=21990/21990 fqs=6453
[  668.563573]  (t=15000 jiffies g=79477 q=24677)
[  668.563580] Task dump for CPU 0:
[  668.563583] task:snapd           state:R  running task     stack:    0 pid:  634 ppid:   157 flags:0x0000000a
[  668.563594] Call trace:
[  668.563596]  dump_backtrace+0x0/0x1c0
[  668.563609]  show_stack+0x1c/0x24
[  668.563616]  sched_show_task+0x164/0x190
[  668.563624]  dump_cpu_task+0x48/0x54
[  668.563630]  rcu_dump_cpu_stacks+0xec/0x130
[  668.563635]  rcu_sched_clock_irq+0x914/0xa50
[  668.563641]  update_process_times+0xa0/0x190
[  668.563648]  tick_sched_timer+0x5c/0xd0
[  668.563655]  __hrtimer_run_queues+0x140/0x330
[  668.563661]  hrtimer_interrupt+0xf4/0x240
[  668.563667]  hv_stimer0_isr+0x28/0x30
[  668.563673]  hv_stimer0_percpu_isr+0x14/0x20
[  668.563677]  handle_percpu_devid_irq+0x8c/0x1b4
[  668.563682]  handle_domain_irq+0x64/0x90
[  668.563689]  gic_handle_irq+0x58/0x128
[  668.563694]  call_on_irq_stack+0x20/0x38
[  668.563699]  do_interrupt_handler+0x54/0x5c
[  668.563720]  el1_interrupt+0x2c/0x4c
[  668.563726]  el1h_64_irq_handler+0x14/0x20
[  668.563731]  el1h_64_irq+0x74/0x78
[  668.563735]  clear_rseq_cs.isra.0+0x4c/0x60
[  668.563755]  do_notify_resume+0x100/0xebc
[  668.563762]  el0_svc+0x3c/0x4c
[  668.563766]  el0t_64_sync_handler+0x9c/0x120
[  668.563771]  el0t_64_sync+0x158/0x15c

And this is after the latest patch Tuesday pathches has been applied:

> wsl --version
WSL-version: 2.1.1.0
Kernelversion: 5.15.146.1-2
WSLg-version: 1.0.60
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3155

vadika commented 4 months ago

Can you tes

And just to show that it's not always the .NET Debugger task that's affected by CPU stalls (which happened to be the case in both stalls in the logs above), here are some more logs in which snapd is the affected tasks:

[  668.559419] rcu: INFO: rcu_sched self-detected stall on CPU
[  668.561475] rcu:     0-....: (14113 ticks this GP) idle=9f7/1/0x4000000000000002 softirq=21990/21990 fqs=6453
[  668.563573]  (t=15000 jiffies g=79477 q=24677)
[  668.563580] Task dump for CPU 0:
[  668.563583] task:snapd           state:R  running task     stack:    0 pid:  634 ppid:   157 flags:0x0000000a
[  668.563594] Call trace:
[  668.563596]  dump_backtrace+0x0/0x1c0
[  668.563609]  show_stack+0x1c/0x24
[  668.563616]  sched_show_task+0x164/0x190
[  668.563624]  dump_cpu_task+0x48/0x54
[  668.563630]  rcu_dump_cpu_stacks+0xec/0x130
[  668.563635]  rcu_sched_clock_irq+0x914/0xa50
[  668.563641]  update_process_times+0xa0/0x190
[  668.563648]  tick_sched_timer+0x5c/0xd0
[  668.563655]  __hrtimer_run_queues+0x140/0x330
[  668.563661]  hrtimer_interrupt+0xf4/0x240
[  668.563667]  hv_stimer0_isr+0x28/0x30
[  668.563673]  hv_stimer0_percpu_isr+0x14/0x20
[  668.563677]  handle_percpu_devid_irq+0x8c/0x1b4
[  668.563682]  handle_domain_irq+0x64/0x90
[  668.563689]  gic_handle_irq+0x58/0x128
[  668.563694]  call_on_irq_stack+0x20/0x38
[  668.563699]  do_interrupt_handler+0x54/0x5c
[  668.563720]  el1_interrupt+0x2c/0x4c
[  668.563726]  el1h_64_irq_handler+0x14/0x20
[  668.563731]  el1h_64_irq+0x74/0x78
[  668.563735]  clear_rseq_cs.isra.0+0x4c/0x60
[  668.563755]  do_notify_resume+0x100/0xebc
[  668.563762]  el0_svc+0x3c/0x4c
[  668.563766]  el0t_64_sync_handler+0x9c/0x120
[  668.563771]  el0t_64_sync+0x158/0x15c

And this is after the latest patch Tuesday pathches has been applied:

> wsl --version
WSL-version: 2.1.1.0
Kernelversion: 5.15.146.1-2
WSLg-version: 1.0.60
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3155

What distro you use?

david-nordvall commented 4 months ago

@vadika I run Ubuntu 22.04.3 LTS (the default distro installed when simply running wsl --install).

vadika commented 4 months ago

Ok, for me it h worked for a day and a half but then hang again :( I'm using debian distro as the lightest one officially available

vadika commented 4 months ago

Since that time I have had wsl hangs every 20 minutes, as it was before. So switching of graphics support eliminates some warnings in kernel logs, but not making overall situation any better. Being a kernel engineer in a mean time, I decided to add some additional debugging features to WSL2 kernel so this lockups could be somehow investigated. For that I have to rebuild the kernel from the sources, altering it's configuration. I tried to clone kernel sources but wasn't able to finish this process any time I tried because WSL VM hangs every time. Luckily I have another arm machine, so then I decide to buld kernel there and just copy the kernel to my arm windows notebook. Also I found a recipe how to build latest ( mainline) kernel for WSL -- https://github.com/oleksis/wsl2-linux-kernel-config So I built kernel verison 6.8-rc4, copied this kernel to my notebook and after modifying .wslconfig and rebooting WSL VM, I was able to clone and build kernel sources in WSL VM, that I can't do before. There were some RCU panics in kernel logs, but they don't lead to complete WSL lockup, system recovered by itself. I will continue to invesitagate those panics, but if you're brave enough you may try to run your WSL2 VM with 6.8 kernel -- I uploaded it to my Dropbox. Here is the link -- https://www.dropbox.com/t/MpRiV3DJsrtKE1U3 Here is what .wslconfig should look like:

[wsl2]

kernel=d:\\Image

gpuSupport=false

debugConsole=true

david-nordvall commented 4 months ago

@vadika, I did give your kernel a quick try but it resulted in numerous crashes of the type I haven't seen before. In other words, it may well fix the hanging issues but, for other reasons, it didn't work long enough for me to verify :-).

It sounds great that you will continue to investigate the panics!

vadika commented 4 months ago

New version of WSL on --pre-release is out -- v 2.1.3 $ wsl --version WSL version: 2.1.3.0 Kernel version: 5.15.146.1-2 WSLg version: 1.0.60 MSRDC version: 1.2.5105 Direct3D version: 1.611.1-81528511 DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows version: 10.0.22631.3155

david-nordvall commented 4 months ago

@vadika Yes, I've been testing it over the weekend and unfortunately the issue most definitely persists. :-(

maxboone commented 4 months ago

Nevermind, I'm building a recent stable kernel (6.7.7) in a Hyper-V VM at the moment (which seems to be a lot more stable than the WSL VM for some reason).

@vadika what debugging features did you add, just the symbols? Then I'll also build them into my kernel to see whether I can dig up some more information on why these RCU stalls happen (or forward anything useful here).

vadika commented 4 months ago

Nevermind, I'm building a recent stable kernel (6.7.7) in a Hyper-V VM at the moment (which seems to be a lot more stable than the WSL VM for some reason).

@vadika what debugging features did you add, just the symbols? Then I'll also build them into my kernel to see whether I can dig up some more information on why these RCU stalls happen (or forward anything useful here).

Sorry, missed your message. Just debugging, and unfortunately has not enough time to do anything more significant.

maxboone commented 4 months ago

Sorry, missed your message. Just debugging, and unfortunately has not enough time to do anything more significant.

@vadika check the other issue and the RCU mailing list. If you compile with RSEQ disabled you will see that the stalls happen when the kernel tries to copy to user.

Possibly there's some unhandled fault there that is interesting to investigate.

vadika commented 4 months ago

Sorry, missed your message. Just debugging, and unfortunately has not enough time to do anything more significant.

@vadika check the other issue and the RCU mailing list. If you compile with RSEQ disabled you will see that the stalls happen when the kernel tries to copy to user.

Possibly there's some unhandled fault there that is interesting to investigate.

Can you point me to exact messages?

maxboone commented 4 months ago

Can you point me to exact messages?

Mainly https://lore.kernel.org/rcu/ab2e611e-3c21-4380-976a-3acee15907c0@joelfernandes.org/ and down from https://github.com/microsoft/WSL/issues/9454#issuecomment-1974877623

maxboone commented 2 months ago

@vadika did you happen to book any progress on this, I personally haven't had the time to dig into it further.

vadika commented 2 months ago

@vadika did you happen to book any progress on this, I personally haven't had the time to dig into it further.

No, unfortunately -- the problem is somewhere out of my scope of understanding, available code is not enough to debug it. I have made several attempts to localize the problem, and seems it's somewhere between VM kernel-hypervisor interaction, probably some communication channel is dropped from HV side makes kernel to deadloop.

microsoft / WSL