microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.3k stars 814 forks source link

WSL complete freeze #8824

Closed schiorean closed 1 year ago

schiorean commented 2 years ago

Version

Microsoft Windows [Version 10.0.22000.918]

WSL Version

Kernel Version

5.15.57.1

Distro Version

Ubuntu 22.04

Other Software

PhpStorm 2022.2.1

Repro Steps

  1. Open a large project in PhpStorm (starts indexing project fileds...)
  2. Open another large project in PhpStorm (starts indexing project fileds...)
  3. Keep switching between large projects and sooner or later WSL is completely frozen while PhpStorm is indexing files

Expected Behavior

PhpStorm finishes files indexing and WSL is usable.

Actual Behavior

WSL freezes completely including any wsl.exe command, so even wsl.exe --shutdown will hang forever. The only way to restart WSL is by doing a computer restart.

Diagnostic Logs

Uploading via Feedback Hub.

schiorean commented 2 years ago

Here is the Feedback Hub WLS logs https://aka.ms/AAi265d

OneBlue commented 2 years ago

Thanks for reporting this @schiorean. Unfortunately I'm not seeing any logs on Feedback Hub. Can you share logs here ?

OneBlue commented 2 years ago

/logs

ghost commented 2 years ago

Hello! Could you please provide more logs to help us better diagnose your issue?

To collect WSL logs, download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The scipt will output the path of the log file once done.

Once completed please upload the output files to this Github issue.

Click here for more info on logging

Thank you!

schiorean commented 2 years ago

@OneBlue here's the log. Many thanks!

WslLogs-2022-09-13_14-38-49.zip

OneBlue commented 2 years ago

Thank you @schiorean. I'm not seeing anything that jumps out in the logs.

This looks like 'freeze' issue we've been investigating for a while.

If you can reproduce this consistently, can you please:

This should give us enough information to understand what's happening

schiorean commented 2 years ago

Hi @OneBlue,

I reproduced it again, but I don't have the gcore output because I can't figure out how to install gdb in the system shell (none of the usual commands are recognized e.g. trying apt install gdb I get -bash: apt: command not found). So until I can install gdb in the system shell (I sent an email to [edited] I am sharing here the output of dmesg and wslservice service dump.

Some extra notes, maybe it's helping:

  1. Even though wsl.exe is frozen the currently open shells are still usable (btw, can't I install gdb in the normal, non-system, shell and provide the dump from there?)
  2. The only way to restart WSL besides a full computer restart, is by opening a terminal as Administrator, then run hcsdiag list followed by hcsdiag kill.

I tried to upload the zip file containing the dumps here, but looks like File size is too big. I uploaded again in Feedback Hub https://aka.ms/AAi265d however when I open Details I can't see it... If you can't see it in Feedback Hub please tell me where to send the file?

OneBlue commented 2 years ago

Oh sorry @schiorean, that was a missed copy-paste from me.

What I meant was: "Install gdb in that shell via tdnf install gdb"

But there was an unrelated email address in my clipboard (edited out). Sorry about that.

Can you try that and share the Linux dumps ? It should be possible to upload them directly on this issue.

schiorean commented 2 years ago

@OneBlue I uploaded dmesg output, wslservice.exe dump is too big (> 25MB) and can't upload it here. dmesg.txt

Later today I hope will provide the dbg dump too.

OneBlue commented 2 years ago

Thank you @schiorean. Sadly nothing jumps out from dmesg so we'll need the dumps to root cause this.

If the dumps are too big for Github, OneDrive / Google drive should work. Given the symptoms I'm suspecting that the issue is on the Linux side, so the Linux dumps should be the most interesting files for this issue.

philmb3487 commented 2 years ago

Hi, I am also getting random freezes like that.

schiorean commented 2 years ago

@OneBlue finally, attached are the core files. And in case you missed it WSL Service logs I uploaded a few days ago https://github.com/microsoft/WSL/issues/8824#issuecomment-1245776058

core.zip

schiorean commented 2 years ago

@OneBlue here's another core dump, this time it happened faster compared with the previous one. I estimate about ~20 minutes since started WSL & system console.

core-take-2.zip

nexton-winjeel commented 2 years ago

We've hit a similar issue here. I can't reliably reproduce it, but I do have a workaround that fixes our specific problem. In our case:

As I said, I can't reliably reproduce this issue, but it seems to happen most often when a file is deleted in WSL (it seems to occur more frequently in the clean stage of our build).

zed76r commented 2 years ago

There is another way to restore WSL usable.

  1. win+s Search WSL
  2. right-click, "App Settings (应用设置 in my localized)"
  3. then click "reset".
schiorean commented 2 years ago

If we exclude the build folder from Windows Defender, we don't see this behaviour.

@sypaq-nexton Unortunately, I have the WSL home folder added in "Virus & threat protection" exclusion for many months already, so it's not a solution (for me at least).

OneBlue commented 2 years ago

Thanks a lot for the dumps @schiorean.

We've published a new version of store wsl, if you can still reproduce the issue with the latest version, can you please share the Windows and Linux dumps of the same repro?

That would help us a lot

schiorean commented 2 years ago

@OneBlue wsl --update says I'm already at the latest version intalled (0.66.2.0).

OneBlue commented 2 years ago

If you're not enrolled in Windows insider, that makes sense since we haven't published that package everywhere yet.

You can install it without an insider account by downloading the package and running something like (elevated PowerShell) :

$installedPackage = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers
Remove-AppxPackage $installedPackage -AllUsers
Add-AppxPackage /path/to/msixbundle
schiorean commented 2 years ago

@OneBlue so it happened again with 0.67.6.0. I attached linux core dump and dmesg. wslservice.exe dump file I uploaded via Feedback hub again, if for some reason you can't access it I will upload it in Google Drive, let me know please. core_take_3.zip dmesg.log

schiorean commented 2 years ago

Actually here's the wslservice.exe dump as well https://drive.google.com/file/d/19nPGc6-NOpv_f0RSp8-5kgRwW1Y-1y5n

OneBlue commented 2 years ago

Thanks @schiorean.

After looking at all the dumps, I have a good idea of where the issue is, but I'll need a bit more info to root cause it.

I built a private version of the package with extra logging: https://1drv.ms/u/s!AiWXuqXSX5K2d45U1MHyOchps0k?e=w8SSjC To install it (elevated powershell):

# Remove the installed package
$installedPackage = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers
Remove-AppxPackage $installedPackage -AllUsers

# Trust the private package's certificate (since it's not an official build, it's not signed with the official Microsoft certificate)
(Get-AuthenticodeSignature "/path/to/msixbundle").SignerCertificate | Export-Certificate -FilePath private-wsl.cert
Import-Certificate -FilePath .\private-wsl.cert -CertStoreLocation Cert:\LocalMachine\Root

# Install the package
Add-AppxPackage /path/to/msixbundle

Once the package is installed:

This should give us more information to identify what's happening.

schiorean commented 2 years ago

@OneBlue attached are the logs are per your latest instructions (took a 2-3 hours until hang happened). wslservice.exe dump uploaded separately here https://drive.google.com/file/d/19nPGc6-NOpv_f0RSp8-5kgRwW1Y-1y5n/view?usp=sharing

WslLogs-2022-09-26_11-09-22.zip dmesg.txt proc.txt vsoc.txt

OneBlue commented 2 years ago

Thank you @schiorean.

With this information, our current theory is that this issue was introduced by the latest Linux kernel upgrade.

To validate this, can you please:

(Make sure that the '\' are doubled)

schiorean commented 2 years ago

@OneBlue sorry but it happened again with the 5.10 kernel as well. Attached is the dmesg, I didn't have the system console started but was able to get it from normal console. Let me know if I can do anything else to help you guys. This is really frustrating and I really don't want to leave Windows, it's perfect for my development besides this thing.

And when I dumped dmesg I confirmed the kernel version:

sorin@think:~$ uname -a Linux think 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

dmesg.log

michaelarnauts commented 2 years ago

I have the same problem as OP. Same usecase (Phpstorm that causes WSL to hang during indexing, or probably, anything that does high IO). It happens multiple times a day, and I can reproduce this quite easily.

A faster way to get a fresh WSL env without rebooting (since the wsl --shutdown indeed hangs forever) is to kill wslservice.exe from the task manager.

The output of collect-wsl-logs.ps1 is here: WslLogs-2022-09-28_14-05-40.zip

wsl --version:

WSL version: 0.66.2.0
Kernel version: 5.15.57.1
WSLg version: 1.0.42
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22000.1042

I'll try the "unpublished" package above and retry.

michaelarnauts commented 2 years ago

@OneBlue I've managed to reproduce it again. I've used your build of wsl.

wsl --version
WSL version: 0.67.7.9
Kernel version: 5.15.62.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22000.1042
MSBuild version: 1929
Commit: 6732c637
Build time: 22:50:36 Sep 23 2022

collect-wsl-logs.ps1 output: WslLogs-2022-09-28_14-45-44.zip

the 3 commands from the system distro shell: logs.txt

And the wslservice.exe coredump: https://1drv.ms/u/s!ArytPhqDXQ94lAIJ85S-_XFeCtJ-?e=VL5rb9

michaelarnauts commented 2 years ago

I'm on the same wsl package now, but with 5.10.102.1-microsoft-standard-WSL2 kernel, and I haven't been able to reproduce yet. I'll keep trying :)

It froze again.

# uname -a
Linux Precision-5570 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

collect-wsl-logs.ps1 output: WslLogs-2022-09-28_15-09-45.zip

the 3 commands from the system distro shell: logs.txt

wslservice.exe coredump: https://1drv.ms/u/s!ArytPhqDXQ94lAauo8FK-wQSjVV8?e=7jteYJ

michaelarnauts commented 2 years ago

@OneBlue This still crashes daily here. Anything else I can do to help troubleshoot this?

I'm on 0.68.2.0 now, the latest release on Github.

wsl --version
WSL version: 0.68.2.0
Kernel version: 5.15.62.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22000.1042
szymonos commented 1 year ago

Update to 0.68.4 hasn't fixed the issue. Still hanging for me and have to restart wslservice every now and then to be able to work.

WSL version: 0.68.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22623.730
OneBlue commented 1 year ago

Thank you @schiorean.

The fact that you managed to reproduce the issue with a different kernel shows that it's not a recent kernel upgrade issue at least.

To give a bit of context what seems to be happened is that the init process is stuck during a syscall while creating a new session. To understand why, we'll need to look at the kernel stacks.

So do this, can you please:

With this we should have the kernel stacks of all the threads in the init processes

alonbl commented 1 year ago

Hi,

Do you have KB5017328 installed on your system? If yes, is the problem reproduced after revert of KB5017328? See #6982, the KB5017328 caused a regression that seems similar and is related to hibernate or similar activity, but confirmed that without KB5017328 it does not manifest.

Regards,

schiorean commented 1 year ago

@OneBlue attached are the new log files. core.zip dmesg.txt for_pid_output.txt meminfo.txt

michaelarnauts commented 1 year ago

@OneBlue

> wsl --version
WSL version: 0.68.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.674

core.zip (all coredumps) wsl dump.txt (all terminal output)

szymonos commented 1 year ago

I've fixed it for myself by unregistering all my WSL distros, removing WSL from Windows Optional Features, adding it again, and creating new WSL distros from scratch.

I believe, that simply recreating WSL distros from scratch would solve my issue, because I think my WSL distros (Debian and fedoraremix) were broken somehow. I exported them before unregistering, but couldn't import them after the process - installation was stuck, but no errors whatsoever. In the end, I've just copied my home folder from exported tars, so there is no issue there, I didn't plan to use them anyway.

OneBlue commented 1 year ago

Thanks a lot @michaelarnauts and @schiorean. With the debugging information you shared we have identified the root cause.

The issue is caused by bug in MUSL which causes a deadlock in init.

This bug only occurs if fork() and aio_* methods are called at the same time AND if malloc() needs to increase the process' memory, so it's pretty rare.

The issue has been reported to the MUSL maintainers, but a fix hasn't been released yet. In the meantime we're working on a workaround inside WSL to avoid the potential deadlock.

I'll update this issue once the workaround will be released.

benhillis commented 1 year ago

Should be fixed with https://github.com/microsoft/WSL/releases/tag/0.70.4

mwoodpatrick commented 1 year ago

I'm running with

WSL version: 0.70.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.45
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22623.870

and am still having this issue I ran

gcore -a $(pgrep init)

and got:

[New LWP 5]
0x00000000003bc9bd in ?? ()
Failed to open 'core.1' for output.
[Inferior 1 (process 1) detached]
gcore: failed to create core.1

Please advise

Are there docs on gcore somewhere? What does the -a option do?

All the help does currently is

gcore --help
usage:  gcore [-a] [-o prefix] pid1 [pid2...pidN]

It would be helpful if this described what the command did and how the -a option modifies the default action

OneBlue commented 1 year ago

@mwoodpatrick: It looks like you were in a directory where gcore wasn't allowed to write. If you need to write dumps I recommend doing it in a directory under /mnt/c so you can easily access them from Windows.

mwoodpatrick commented 1 year ago

Why was this issue closed I'm still having this issue? Please reopen it.

gcore -a $(pgrep init) [New LWP 5] 0x00000000003bc9cd in ?? () Saved corefile core.1 [Inferior 1 (process 1) detached] [New LWP 28] warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container. 0x00000000003bc9cd in ?? () Saved corefile core.2 [Inferior 1 (process 2) detached] 0x00000000003bc9cd in ?? () Saved corefile core.6 [Inferior 1 (process 6) detached] [New LWP 27] warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container. 0x00000000003bc9cd in ?? () warning: target file /proc/26/cmdline contained unexpected null characters Saved corefile core.26 [Inferior 1 (process 26) detached] 0x00000000003bc9cd in ?? () Saved corefile core.31 [Inferior 1 (process 31) detached] 0x00000000003bc9cd in ?? () Saved corefile core.32 [Inferior 1 (process 32) detached] root@MarkSpectre14 [ /mnt/c/Users/mlwp/Software/WSL/10_29_2022 ]# ls core.1 core.2 core.26 core.31 core.32 core.6

wsl_hang.zip

NotTheDr01ds commented 1 year ago

@schiorean Since you originally reported the issue, I'm curious if 0.70.4 fixed it for you? I see you thumbs-up'd Ben's comment and close message, so I'm guessing it did fix. If so, that may mean that @mwoodpatrick may be experiencing a similar issue with a different root cause.

schiorean commented 1 year ago

Since I upgraded to 0.70.4 I didn't experience any freeze. All good for me.

evakili commented 1 year ago

I see the problem even after updating to 0.70.4.

Version Info - WSL version: 0.70.4.0 - Kernel version: 5.15.68.1 - WSLg version: 1.0.45 - MSRDC version: 1.2.3575 - Direct3D version: 1.606.4 - DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp - Windows version: 10.0.22000.1165
NotTheDr01ds commented 1 year ago

@evakili and @mwoodpatrick Since the person who originally reported this issue has confirmed that it is fixed, it seems that you are likely experiencing a different issue.

hwine commented 1 year ago

There are a number of us who still have the issue (but I, e.g., don't have good data to post). If anyone does open a new issue, please be sure to mention it here, so us "camp followers" can watch that as well. :smile:

androiddisk commented 1 year ago

I have the same problem. No new wsl shell window can be created, and the wsl -- shutdown window is stuck. Even the window cannot be closed. Version 0.70.4.0 is not downloaded from GitHub. It is the 0.70.4.0 version automatically updated by Windows. It should be the same as GitHub https://github.com/microsoft/WSL/releases/tag/0.70.4 This problem has troubled me for a long time.

PS C:\Users\zhong> wsl --version
WSL 版本: 0.70.4.0
内核版本: 5.15.68.1
WSLg 版本: 1.0.45
MSRDC 版本: 1.2.3575
Direct3D 版本: 1.606.4
DXCore 版本: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows版本: 10.0.22621.674

I will cooperate with you if you need me. thank you

XhmikosR commented 1 year ago

Unfortunately, this is definitely an issue for me too. I confirmed it on 2 different machines, and it makes WSL totally unusable. wsl --shutdown doesn't work either, so the only solution is to kill wslservice but it's not a real workaround since VS Code is stuck too...

C:\Users\user>wsl --version
WSL version: 0.70.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.45
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.819
bowmanjd commented 1 year ago

I, too, had this issue for some time. I got in the habit of closing all WSL windows at the end of the day, and this seemed to help. But I would still occasionally have freezing issues even after installing the fixed WSL version.

So... last week in a terrible and painful mistake, I deleted my WSL virtual disk for the distro I use the most. Since re-creating it, I haven't had this issue at all. Go figure. @XhmikosR, I never had the "totally unusable" experience that you had; it was more of a minor inconvenience. But if it is that bad, you may want to consider redoing your WSL distros.

For what it is worth, I moved from Fedora 36 to Fedora 37. Unsure if that is relevant, though.

XhmikosR commented 1 year ago

The thing is that the issue just started to appear for me a few weeks ago. Before that, everything worked fine.

Now, using Docker + VS Code, breaks quite frequently during the day. The only workaround for me is stopping and starting wslservice, but it breaks my workflow, totally.

I will try reinstalling Ubuntu, but there are other people having this issue, see also #9114.

StewartWon commented 1 year ago

I too have been having issues with this. It always occurs when doing a large c++ cmake build (30mins). WSL freezes. This started happening a couple of months ago and basically makes WSL useless. Seems like compiling in Visual Studio in Windows and compiling in WSL makes it worse, but that may be a coincidence. This happens with multiple fresh Ubuntu installations too.

Also get: The Windows Subsystem for Linux service is stopping........ The Windows Subsystem for Linux service could not be stopped.

A reboot is all that works