Open david-nordvall opened 8 months ago
There is a Windows OS fix that targets an ARM-specific hang that should be going out next month (I think the fourth patch Tuesday of November).
@benhillis That sounds hopeful! Does this fix specifically target hangs in WSL? I've had zero issues with hangs and slow downs outside of WSL.
@david-nordvall - yes it is a hypervisor fix specific to running virtual machines.
@david-nordvall fyi, I also got the failure messages -- turns out the dump utility in the MS script doesn't work for ARM, but I could get valid dumps with procdump64a.exe
as described over here
@david-nordvall - yes it is a hypervisor fix specific to running virtual machines.
@benhillis Do you know if the fix has been released yet. I have installed all updates available for my Surface Pro 9 5G to date but the problem persists. I am currently running the following:
WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.2792
@benhillis One more patch tuesday-patch applied and the issue still persists. Any word on if the hypervisor fix has actually been released? If it has, how can I help you get more data for debugging? This issue really needs to get fixed... :-(
Current WSL version info:
WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.2861
Update: I completely wiped my Surface Pro 9 and re-installed Windows (using the restore functionality built in to Windows). The issue still persists :-(
WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.2861
@benhillis can you confirm or deny if the hypervisor fix you mentioned earlier has been released?
Another patch Tuesday-patch applied and the issue still persists.
WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3007
@benhillis and @craigloewen-msft, I'm happy to provide you with any logs you may need to property debug and fix this issue.
So, I realize Windows on ARM isn't Microsoft's biggest priority. Especially issues specific for developers using Windows on ARM, since there probably are very few such developers. But I'd still love to hear if the Hyper-V fix you mentioned, @benhillis, has been released yet?
Considering that there "soon" will be devices with Qualcomm's new Snapdragon X processor, which might appeal to more developers, and that Docker might release a Docker Desktop version for Windows on ARM during 2024, I'd hope Microsoft would pay a little more attention to this issue, since it pretty much makes Windows on ARM a non-starter as a developer platform, regardless of the processor's underlying performance and efficiency.
Another patch Tuesday-patch applied. The issue still persists:
WSL-version: 2.0.9.0
Kernelversion: 5.15.133.1-1
WSLg-version: 1.0.59
MSRDC-version: 1.2.4677
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3085
Having exactly the same issue. WSL hangs quickly after using VS code remote.
Thinkpad x13s gen1 -- looks like I found the solution, at least it works for me -- no hangs, no excessive cpu load and survives sleep/resume. Version WSL: 2.1.1.0 Version kernel: 5.15.146.1-2 Version WSLg: 1.0.60 Version MSRDC: 1.2.5105 Version Direct3D: 1.611.1-81528511 Version DXCore: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Version Windows: 10.0.22631.3085
I upgraded WSL to pre-release ( wsl --update --pre-release) and disabled DirectX ( /Users/%username% .wslconfig --> [wsl2] gpuSupport=false ) Did it because there were kernel messages about dx interaction in linux kernel log. Restarted WSL and have no problems since (24 hours stable now)
@vadika, that's great! Please keep me posted if the fix works more long term than 24 hours :-). Unfortunately it made no (or little) difference for me on my Surface Pro 9 5G.
WSL-version: 2.1.1.0
Kernelversion: 5.15.146.1-2
WSLg-version: 1.0.60
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3085
I can still make WSL hang and/or use excessive CPU by starting a VS Code dev conatiner with a couple of moderately complex .NET projects. I may be able to work a little longer before WSL starts to freeze, it is difficult to tell. Either way, using a pre-release version of WSL and disabling GPU support doesn't fix the issue for me.
Some logs from dmesg -w
. Don't know if this is useful info, @benhillis and @craigloewen-msft?
[ 429.728472] rcu: INFO: rcu_sched self-detected stall on CPU
[ 429.732010] rcu: 3-....: (14771 ticks this GP) idle=0d1/1/0x4000000000000002 softirq=21325/21325 fqs=6694
[ 429.733644] (t=15002 jiffies g=58457 q=34118)
[ 429.733665] Task dump for CPU 3:
[ 429.733666] task:.NET Debugger state:R running task stack: 0 pid: 2812 ppid: 1198 flags:0x0000080f
[ 429.733670] Call trace:
[ 429.733671] dump_backtrace+0x0/0x1c0
[ 429.733762] show_stack+0x1c/0x24
[ 429.733764] sched_show_task+0x164/0x190
[ 429.733783] dump_cpu_task+0x48/0x54
[ 429.733833] rcu_dump_cpu_stacks+0xec/0x130
[ 429.733834] rcu_sched_clock_irq+0x914/0xa50
[ 429.733845] update_process_times+0xa0/0x190
[ 429.733855] tick_sched_timer+0x5c/0xd0
[ 429.733882] __hrtimer_run_queues+0x140/0x330
[ 429.733883] hrtimer_interrupt+0xf4/0x240
[ 429.733885] hv_stimer0_isr+0x28/0x30
[ 429.733927] hv_stimer0_percpu_isr+0x14/0x20
[ 429.733928] handle_percpu_devid_irq+0x8c/0x1b4
[ 429.733937] handle_domain_irq+0x64/0x90
[ 429.733947] gic_handle_irq+0x58/0x128
[ 429.733948] call_on_irq_stack+0x20/0x38
[ 429.733950] do_interrupt_handler+0x54/0x5c
[ 429.733951] el1_interrupt+0x2c/0x4c
[ 429.733962] el1h_64_irq_handler+0x14/0x20
[ 429.733963] el1h_64_irq+0x74/0x78
[ 429.733964] mm_release+0xd8/0x140
[ 429.733983] exit_mm_release+0x2c/0x40
[ 429.733984] do_exit+0x1a0/0xa70
[ 429.733993] do_group_exit+0x3c/0xa4
[ 429.733995] get_signal+0x1e0/0x9b0
[ 429.734005] do_notify_resume+0x15c/0xebc
[ 429.734007] el0_svc+0x3c/0x4c
[ 429.734008] el0t_64_sync_handler+0x9c/0x120
[ 429.734009] el0t_64_sync+0x158/0x15c
[ 451.744411] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 3-... 4-... } 15308 jiffies s: 501 root: 0x18/.
[ 451.745440] rcu: blocking rcu_node structures (internal RCU debug):
[ 451.745813] Task dump for CPU 3:
[ 451.745815] task:.NET Debugger state:R running task stack: 0 pid: 2812 ppid: 1198 flags:0x0000080f
[ 451.745819] Call trace:
[ 451.745820] __switch_to+0xb4/0xec
[ 451.745827] el1h_64_irq+0x74/0x78
[ 451.745829] mm_release+0xd8/0x140
[ 451.745857] exit_mm_release+0x2c/0x40
[ 451.745858] do_exit+0x1a0/0xa70
[ 451.745860] do_group_exit+0x3c/0xa4
[ 451.745861] get_signal+0x1e0/0x9b0
[ 451.745863] do_notify_resume+0x15c/0xebc
[ 451.745865] el0_svc+0x3c/0x4c
[ 451.745867] el0t_64_sync_handler+0x9c/0x120
[ 451.745869] el0t_64_sync+0x158/0x15c
[ 451.745870] Task dump for CPU 4:
[ 451.745870] task:snapd state:R running task stack: 0 pid: 651 ppid: 160 flags:0x0000000a
[ 451.745873] Call trace:
[ 451.745873] __switch_to+0xb4/0xec
[ 451.745875] 0xffff0000193cc880
> wsl --version
WSL-version: 2.1.1.0
Kernelversion: 5.15.146.1-2
WSLg-version: 1.0.60
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3085
And just to show that it's not always the .NET Debugger
task that's affected by CPU stalls (which happened to be the case in both stalls in the logs above), here are some more logs in which snapd
is the affected tasks:
[ 668.559419] rcu: INFO: rcu_sched self-detected stall on CPU
[ 668.561475] rcu: 0-....: (14113 ticks this GP) idle=9f7/1/0x4000000000000002 softirq=21990/21990 fqs=6453
[ 668.563573] (t=15000 jiffies g=79477 q=24677)
[ 668.563580] Task dump for CPU 0:
[ 668.563583] task:snapd state:R running task stack: 0 pid: 634 ppid: 157 flags:0x0000000a
[ 668.563594] Call trace:
[ 668.563596] dump_backtrace+0x0/0x1c0
[ 668.563609] show_stack+0x1c/0x24
[ 668.563616] sched_show_task+0x164/0x190
[ 668.563624] dump_cpu_task+0x48/0x54
[ 668.563630] rcu_dump_cpu_stacks+0xec/0x130
[ 668.563635] rcu_sched_clock_irq+0x914/0xa50
[ 668.563641] update_process_times+0xa0/0x190
[ 668.563648] tick_sched_timer+0x5c/0xd0
[ 668.563655] __hrtimer_run_queues+0x140/0x330
[ 668.563661] hrtimer_interrupt+0xf4/0x240
[ 668.563667] hv_stimer0_isr+0x28/0x30
[ 668.563673] hv_stimer0_percpu_isr+0x14/0x20
[ 668.563677] handle_percpu_devid_irq+0x8c/0x1b4
[ 668.563682] handle_domain_irq+0x64/0x90
[ 668.563689] gic_handle_irq+0x58/0x128
[ 668.563694] call_on_irq_stack+0x20/0x38
[ 668.563699] do_interrupt_handler+0x54/0x5c
[ 668.563720] el1_interrupt+0x2c/0x4c
[ 668.563726] el1h_64_irq_handler+0x14/0x20
[ 668.563731] el1h_64_irq+0x74/0x78
[ 668.563735] clear_rseq_cs.isra.0+0x4c/0x60
[ 668.563755] do_notify_resume+0x100/0xebc
[ 668.563762] el0_svc+0x3c/0x4c
[ 668.563766] el0t_64_sync_handler+0x9c/0x120
[ 668.563771] el0t_64_sync+0x158/0x15c
And this is after the latest patch Tuesday pathches has been applied:
> wsl --version
WSL-version: 2.1.1.0
Kernelversion: 5.15.146.1-2
WSLg-version: 1.0.60
MSRDC-version: 1.2.5105
Direct3D-version: 1.611.1-81528511
DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows-version: 10.0.22631.3155
Can you tes
And just to show that it's not always the
.NET Debugger
task that's affected by CPU stalls (which happened to be the case in both stalls in the logs above), here are some more logs in whichsnapd
is the affected tasks:[ 668.559419] rcu: INFO: rcu_sched self-detected stall on CPU [ 668.561475] rcu: 0-....: (14113 ticks this GP) idle=9f7/1/0x4000000000000002 softirq=21990/21990 fqs=6453 [ 668.563573] (t=15000 jiffies g=79477 q=24677) [ 668.563580] Task dump for CPU 0: [ 668.563583] task:snapd state:R running task stack: 0 pid: 634 ppid: 157 flags:0x0000000a [ 668.563594] Call trace: [ 668.563596] dump_backtrace+0x0/0x1c0 [ 668.563609] show_stack+0x1c/0x24 [ 668.563616] sched_show_task+0x164/0x190 [ 668.563624] dump_cpu_task+0x48/0x54 [ 668.563630] rcu_dump_cpu_stacks+0xec/0x130 [ 668.563635] rcu_sched_clock_irq+0x914/0xa50 [ 668.563641] update_process_times+0xa0/0x190 [ 668.563648] tick_sched_timer+0x5c/0xd0 [ 668.563655] __hrtimer_run_queues+0x140/0x330 [ 668.563661] hrtimer_interrupt+0xf4/0x240 [ 668.563667] hv_stimer0_isr+0x28/0x30 [ 668.563673] hv_stimer0_percpu_isr+0x14/0x20 [ 668.563677] handle_percpu_devid_irq+0x8c/0x1b4 [ 668.563682] handle_domain_irq+0x64/0x90 [ 668.563689] gic_handle_irq+0x58/0x128 [ 668.563694] call_on_irq_stack+0x20/0x38 [ 668.563699] do_interrupt_handler+0x54/0x5c [ 668.563720] el1_interrupt+0x2c/0x4c [ 668.563726] el1h_64_irq_handler+0x14/0x20 [ 668.563731] el1h_64_irq+0x74/0x78 [ 668.563735] clear_rseq_cs.isra.0+0x4c/0x60 [ 668.563755] do_notify_resume+0x100/0xebc [ 668.563762] el0_svc+0x3c/0x4c [ 668.563766] el0t_64_sync_handler+0x9c/0x120 [ 668.563771] el0t_64_sync+0x158/0x15c
And this is after the latest patch Tuesday pathches has been applied:
> wsl --version WSL-version: 2.1.1.0 Kernelversion: 5.15.146.1-2 WSLg-version: 1.0.60 MSRDC-version: 1.2.5105 Direct3D-version: 1.611.1-81528511 DXCore-version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows-version: 10.0.22631.3155
What distro you use?
@vadika I run Ubuntu 22.04.3 LTS (the default distro installed when simply running wsl --install
).
Ok, for me it h worked for a day and a half but then hang again :( I'm using debian distro as the lightest one officially available
Since that time I have had wsl hangs every 20 minutes, as it was before. So switching of graphics support eliminates some warnings in kernel logs, but not making overall situation any better. Being a kernel engineer in a mean time, I decided to add some additional debugging features to WSL2 kernel so this lockups could be somehow investigated. For that I have to rebuild the kernel from the sources, altering it's configuration. I tried to clone kernel sources but wasn't able to finish this process any time I tried because WSL VM hangs every time. Luckily I have another arm machine, so then I decide to buld kernel there and just copy the kernel to my arm windows notebook. Also I found a recipe how to build latest ( mainline) kernel for WSL -- https://github.com/oleksis/wsl2-linux-kernel-config So I built kernel verison 6.8-rc4, copied this kernel to my notebook and after modifying .wslconfig and rebooting WSL VM, I was able to clone and build kernel sources in WSL VM, that I can't do before. There were some RCU panics in kernel logs, but they don't lead to complete WSL lockup, system recovered by itself. I will continue to invesitagate those panics, but if you're brave enough you may try to run your WSL2 VM with 6.8 kernel -- I uploaded it to my Dropbox. Here is the link -- https://www.dropbox.com/t/MpRiV3DJsrtKE1U3 Here is what .wslconfig should look like:
[wsl2]
kernel=d:\\Image
gpuSupport=false
debugConsole=true
@vadika, I did give your kernel a quick try but it resulted in numerous crashes of the type I haven't seen before. In other words, it may well fix the hanging issues but, for other reasons, it didn't work long enough for me to verify :-).
It sounds great that you will continue to investigate the panics!
New version of WSL on --pre-release is out -- v 2.1.3 $ wsl --version WSL version: 2.1.3.0 Kernel version: 5.15.146.1-2 WSLg version: 1.0.60 MSRDC version: 1.2.5105 Direct3D version: 1.611.1-81528511 DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows version: 10.0.22631.3155
@vadika Yes, I've been testing it over the weekend and unfortunately the issue most definitely persists. :-(
Nevermind, I'm building a recent stable kernel (6.7.7) in a Hyper-V VM at the moment (which seems to be a lot more stable than the WSL VM for some reason).
@vadika what debugging features did you add, just the symbols? Then I'll also build them into my kernel to see whether I can dig up some more information on why these RCU stalls happen (or forward anything useful here).
Nevermind, I'm building a recent stable kernel (6.7.7) in a Hyper-V VM at the moment (which seems to be a lot more stable than the WSL VM for some reason).
@vadika what debugging features did you add, just the symbols? Then I'll also build them into my kernel to see whether I can dig up some more information on why these RCU stalls happen (or forward anything useful here).
Sorry, missed your message. Just debugging, and unfortunately has not enough time to do anything more significant.
Sorry, missed your message. Just debugging, and unfortunately has not enough time to do anything more significant.
@vadika check the other issue and the RCU mailing list. If you compile with RSEQ disabled you will see that the stalls happen when the kernel tries to copy to user.
Possibly there's some unhandled fault there that is interesting to investigate.
Sorry, missed your message. Just debugging, and unfortunately has not enough time to do anything more significant.
@vadika check the other issue and the RCU mailing list. If you compile with RSEQ disabled you will see that the stalls happen when the kernel tries to copy to user.
Possibly there's some unhandled fault there that is interesting to investigate.
Can you point me to exact messages?
Can you point me to exact messages?
Mainly https://lore.kernel.org/rcu/ab2e611e-3c21-4380-976a-3acee15907c0@joelfernandes.org/ and down from https://github.com/microsoft/WSL/issues/9454#issuecomment-1974877623
@vadika did you happen to book any progress on this, I personally haven't had the time to dig into it further.
@vadika did you happen to book any progress on this, I personally haven't had the time to dig into it further.
No, unfortunately -- the problem is somewhere out of my scope of understanding, available code is not enough to debug it. I have made several attempts to localize the problem, and seems it's somewhere between VM kernel-hypervisor interaction, probably some communication channel is dropped from HV side makes kernel to deadloop.
Windows Version
10.0.22621.2428
WSL Version
1.2.5.0
Are you using WSL 1 or WSL 2?
Kernel Version
5.15.90.1
Distro Version
Ubuntu 22.04
Other Software
Repro Steps
Start a VS Code remote development environment with a development Docker container running on Docker Engine on Ubuntu on WSL2. Start working/compiling/debugging and observe that the WSL2 environment, including all Docker containers running on WSL2, will start to slow down/hang with high CPU usage after a short while (specifically the
VmmemWSL
process). The environment will completely hang for a number of seconds, then continue working for a while.The problems not only affect the Docker environment but also "native WSL2" shells/environments.
The problems get increasingly worse (more frequent hangs which lasts for longer and longer time) until the WSL2 is completely unusable. Most of the time, running
wsl --shutdown
will shut WSL2 down, after which it can be restarted, and the slow/hang cycle can begin again. Note that runningwsl --shutdown
takes abnormally long to execute when this issue has been encountered (30s to 10 min).More complex development environments (for example multiple code projects compiling in parallell, development databases and other software running in containers) will speed up the slow/hang cycle.
This issue is created per request in issue #9454.
Expected Behavior
The VS Code remote development environment will not slow down/start to hang over time and will continue using "normal" CPU cycles.
Actual Behavior
The VS Code remove development environment, and WSL2 as a whole, starts to slow down/hang with increasing frequency.
Diagnostic Logs
WSL dumps collected during a hang is available at https://doits-my.sharepoint.com/:u:/g/personal/david_nordvall_doitsolutions_se/EedqfOgnSnJOtLAKe7Jd2R0BqtdtieIthWXUbvLOPBf9zg?e=2N8fuq.
Please note that I got several instances of this error when the dumps were collected: