Closed schiorean closed 1 year ago
Here is the Feedback Hub WLS logs https://aka.ms/AAi265d
Thanks for reporting this @schiorean. Unfortunately I'm not seeing any logs on Feedback Hub. Can you share logs here ?
/logs
Hello! Could you please provide more logs to help us better diagnose your issue?
To collect WSL logs, download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:
Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1
The scipt will output the path of the log file once done.
Once completed please upload the output files to this Github issue.
Click here for more info on logging
Thank you!
@OneBlue here's the log. Many thanks!
Thank you @schiorean. I'm not seeing anything that jumps out in the logs.
This looks like 'freeze' issue we've been investigating for a while.
If you can reproduce this consistently, can you please:
gcore -a $(pgrep init)
core.*
files, please shares those on this issuedmesg
in the system shellThis should give us enough information to understand what's happening
Hi @OneBlue,
I reproduced it again, but I don't have the gcore output because I can't figure out how to install gdb in the system shell (none of the usual commands are recognized e.g. trying apt install gdb
I get -bash: apt: command not found
). So until I can install gdb in the system shell (I sent an email to [edited] I am sharing here the output of dmesg
and wslservice
service dump.
Some extra notes, maybe it's helping:
wsl.exe
is frozen the currently open shells are still usable (btw, can't I install gdb in the normal, non-system, shell and provide the dump from there?)hcsdiag list
followed by hcsdiag kill
.I tried to upload the zip file containing the dumps here, but looks like File size is too big. I uploaded again in Feedback Hub https://aka.ms/AAi265d however when I open Details I can't see it... If you can't see it in Feedback Hub please tell me where to send the file?
Oh sorry @schiorean, that was a missed copy-paste from me.
What I meant was:
"Install gdb in that shell via tdnf install gdb
"
But there was an unrelated email address in my clipboard (edited out). Sorry about that.
Can you try that and share the Linux dumps ? It should be possible to upload them directly on this issue.
@OneBlue
I uploaded dmesg
output, wslservice.exe dump is too big (> 25MB) and can't upload it here.
dmesg.txt
Later today I hope will provide the dbg
dump too.
Thank you @schiorean. Sadly nothing jumps out from dmesg so we'll need the dumps to root cause this.
If the dumps are too big for Github, OneDrive / Google drive should work. Given the symptoms I'm suspecting that the issue is on the Linux side, so the Linux dumps should be the most interesting files for this issue.
Hi, I am also getting random freezes like that.
@OneBlue finally, attached are the core files. And in case you missed it WSL Service logs I uploaded a few days ago https://github.com/microsoft/WSL/issues/8824#issuecomment-1245776058
@OneBlue here's another core dump, this time it happened faster compared with the previous one. I estimate about ~20 minutes since started WSL & system console.
We've hit a similar issue here. I can't reliably reproduce it, but I do have a workaround that fixes our specific problem. In our case:
As I said, I can't reliably reproduce this issue, but it seems to happen most often when a file is deleted in WSL (it seems to occur more frequently in the clean
stage of our build).
There is another way to restore WSL usable.
If we exclude the build folder from Windows Defender, we don't see this behaviour.
@sypaq-nexton Unortunately, I have the WSL home folder added in "Virus & threat protection" exclusion for many months already, so it's not a solution (for me at least).
Thanks a lot for the dumps @schiorean.
We've published a new version of store wsl, if you can still reproduce the issue with the latest version, can you please share the Windows and Linux dumps of the same repro?
That would help us a lot
@OneBlue wsl --update
says I'm already at the latest version intalled (0.66.2.0).
If you're not enrolled in Windows insider, that makes sense since we haven't published that package everywhere yet.
You can install it without an insider account by downloading the package and running something like (elevated PowerShell) :
$installedPackage = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers
Remove-AppxPackage $installedPackage -AllUsers
Add-AppxPackage /path/to/msixbundle
@OneBlue so it happened again with 0.67.6.0. I attached linux core dump and dmesg. wslservice.exe dump file I uploaded via Feedback hub again, if for some reason you can't access it I will upload it in Google Drive, let me know please. core_take_3.zip dmesg.log
Actually here's the wslservice.exe dump as well https://drive.google.com/file/d/19nPGc6-NOpv_f0RSp8-5kgRwW1Y-1y5n
Thanks @schiorean.
After looking at all the dumps, I have a good idea of where the issue is, but I'll need a bit more info to root cause it.
I built a private version of the package with extra logging: https://1drv.ms/u/s!AiWXuqXSX5K2d45U1MHyOchps0k?e=w8SSjC To install it (elevated powershell):
# Remove the installed package
$installedPackage = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers
Remove-AppxPackage $installedPackage -AllUsers
# Trust the private package's certificate (since it's not an official build, it's not signed with the official Microsoft certificate)
(Get-AuthenticodeSignature "/path/to/msixbundle").SignerCertificate | Export-Certificate -FilePath private-wsl.cert
Import-Certificate -FilePath .\private-wsl.cert -CertStoreLocation Cert:\LocalMachine\Root
# Install the package
Add-AppxPackage /path/to/msixbundle
Once the package is installed:
collect-wsl-logs.ps1
ss -lap --vsock
, ls -la /proc/*/fd
and dmesg
and share the output of the three commandsThis should give us more information to identify what's happening.
@OneBlue attached are the logs are per your latest instructions (took a 2-3 hours until hang happened). wslservice.exe dump uploaded separately here https://drive.google.com/file/d/19nPGc6-NOpv_f0RSp8-5kgRwW1Y-1y5n/view?usp=sharing
Thank you @schiorean.
With this information, our current theory is that this issue was introduced by the latest Linux kernel upgrade.
To validate this, can you please:
%userprofile%/.wslconfig:
[wsl2]
kernel=C:\\path\\to\\kernel-5-10
(Make sure that the '\' are doubled)
wsl.exe --shutdown
wsl uname -a
to make sure that the correct kernel is in use. It should say: Linux [hostname] 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@OneBlue sorry but it happened again with the 5.10 kernel as well. Attached is the dmesg, I didn't have the system console started but was able to get it from normal console. Let me know if I can do anything else to help you guys. This is really frustrating and I really don't want to leave Windows, it's perfect for my development besides this thing.
And when I dumped dmesg I confirmed the kernel version:
sorin@think:~$ uname -a Linux think 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
I have the same problem as OP. Same usecase (Phpstorm that causes WSL to hang during indexing, or probably, anything that does high IO). It happens multiple times a day, and I can reproduce this quite easily.
A faster way to get a fresh WSL env without rebooting (since the wsl --shutdown
indeed hangs forever) is to kill wslservice.exe
from the task manager.
The output of collect-wsl-logs.ps1
is here:
WslLogs-2022-09-28_14-05-40.zip
wsl --version
:
WSL version: 0.66.2.0
Kernel version: 5.15.57.1
WSLg version: 1.0.42
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22000.1042
I'll try the "unpublished" package above and retry.
@OneBlue I've managed to reproduce it again. I've used your build of wsl.
wsl --version
WSL version: 0.67.7.9
Kernel version: 5.15.62.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22000.1042
MSBuild version: 1929
Commit: 6732c637
Build time: 22:50:36 Sep 23 2022
collect-wsl-logs.ps1 output: WslLogs-2022-09-28_14-45-44.zip
the 3 commands from the system distro shell: logs.txt
And the wslservice.exe coredump: https://1drv.ms/u/s!ArytPhqDXQ94lAIJ85S-_XFeCtJ-?e=VL5rb9
I'm on the same wsl package now, but with 5.10.102.1-microsoft-standard-WSL2
kernel, and I haven't been able to reproduce yet. I'll keep trying :)
It froze again.
# uname -a
Linux Precision-5570 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
collect-wsl-logs.ps1 output: WslLogs-2022-09-28_15-09-45.zip
the 3 commands from the system distro shell: logs.txt
wslservice.exe coredump: https://1drv.ms/u/s!ArytPhqDXQ94lAauo8FK-wQSjVV8?e=7jteYJ
@OneBlue This still crashes daily here. Anything else I can do to help troubleshoot this?
I'm on 0.68.2.0 now, the latest release on Github.
wsl --version
WSL version: 0.68.2.0
Kernel version: 5.15.62.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22000.1042
Update to 0.68.4 hasn't fixed the issue. Still hanging for me and have to restart wslservice every now and then to be able to work.
WSL version: 0.68.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22623.730
Thank you @schiorean.
The fact that you managed to reproduce the issue with a different kernel shows that it's not a recent kernel upgrade issue at least.
To give a bit of context what seems to be happened is that the init process is stuck during a syscall while creating a new session. To understand why, we'll need to look at the kernel stacks.
So do this, can you please:
First clear everything by running wsl --shutdown (make sure nothing is running). If this is unresponsive, kill wslservice.exe.
Then (if not already done) upgrade to wsl 0.68.4 by running: wsl --update
and undo the changes to .wslconfig (meaning revert to the official kernel)
Open a shell inside your the system distro (and leave it open) via: wsl -u root --system
Install gdb in that shell via: tdnf install gdb
(You'll need to do it again because the system distro is not saved between reboots)
Run the steps you described to get WSL into that "frozen" state in your regular distro (not inside the system distro)
Once WSL is frozen, use the previously opened shell to dump all the 'init' processes via: gcore -a $(pgrep init)
This will generate a few core.* files, please shares those on this issue
Then please share the output of (Still in the system shell) for pid in $(pgrep init) ; do echo -e "\nProcess: $pid" && for tid in $(ls "/proc/$pid/task") ; do echo "tid: $tid" && cat "/proc/$pid/task/$tid/stack" ; done ; done
Also please share the output of dmesg
and cat /proc/meminfo
in the system shell
With this we should have the kernel stacks of all the threads in the init processes
Hi,
Do you have KB5017328 installed on your system? If yes, is the problem reproduced after revert of KB5017328? See #6982, the KB5017328 caused a regression that seems similar and is related to hibernate or similar activity, but confirmed that without KB5017328 it does not manifest.
Regards,
@OneBlue attached are the new log files. core.zip dmesg.txt for_pid_output.txt meminfo.txt
@OneBlue
> wsl --version
WSL version: 0.68.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.44
MSRDC version: 1.2.3401
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.674
core.zip (all coredumps) wsl dump.txt (all terminal output)
I've fixed it for myself by unregistering all my WSL distros, removing WSL from Windows Optional Features, adding it again, and creating new WSL distros from scratch.
I believe, that simply recreating WSL distros from scratch would solve my issue, because I think my WSL distros (Debian and fedoraremix) were broken somehow. I exported them before unregistering, but couldn't import them after the process - installation was stuck, but no errors whatsoever. In the end, I've just copied my home folder from exported tars, so there is no issue there, I didn't plan to use them anyway.
Thanks a lot @michaelarnauts and @schiorean. With the debugging information you shared we have identified the root cause.
The issue is caused by bug in MUSL which causes a deadlock in init
.
This bug only occurs if fork() and aio_* methods are called at the same time AND if malloc() needs to increase the process' memory, so it's pretty rare.
The issue has been reported to the MUSL maintainers, but a fix hasn't been released yet. In the meantime we're working on a workaround inside WSL to avoid the potential deadlock.
I'll update this issue once the workaround will be released.
Should be fixed with https://github.com/microsoft/WSL/releases/tag/0.70.4
I'm running with
WSL version: 0.70.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.45
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22623.870
and am still having this issue I ran
gcore -a $(pgrep init)
and got:
[New LWP 5]
0x00000000003bc9bd in ?? ()
Failed to open 'core.1' for output.
[Inferior 1 (process 1) detached]
gcore: failed to create core.1
Please advise
Are there docs on gcore somewhere? What does the -a option do?
All the help does currently is
gcore --help
usage: gcore [-a] [-o prefix] pid1 [pid2...pidN]
It would be helpful if this described what the command did and how the -a option modifies the default action
@mwoodpatrick: It looks like you were in a directory where gcore wasn't allowed to write. If you need to write dumps I recommend doing it in a directory under /mnt/c
so you can easily access them from Windows.
Why was this issue closed I'm still having this issue? Please reopen it.
gcore -a $(pgrep init) [New LWP 5] 0x00000000003bc9cd in ?? () Saved corefile core.1 [Inferior 1 (process 1) detached] [New LWP 28] warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container. 0x00000000003bc9cd in ?? () Saved corefile core.2 [Inferior 1 (process 2) detached] 0x00000000003bc9cd in ?? () Saved corefile core.6 [Inferior 1 (process 6) detached] [New LWP 27] warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container. 0x00000000003bc9cd in ?? () warning: target file /proc/26/cmdline contained unexpected null characters Saved corefile core.26 [Inferior 1 (process 26) detached] 0x00000000003bc9cd in ?? () Saved corefile core.31 [Inferior 1 (process 31) detached] 0x00000000003bc9cd in ?? () Saved corefile core.32 [Inferior 1 (process 32) detached] root@MarkSpectre14 [ /mnt/c/Users/mlwp/Software/WSL/10_29_2022 ]# ls core.1 core.2 core.26 core.31 core.32 core.6
@schiorean Since you originally reported the issue, I'm curious if 0.70.4 fixed it for you? I see you thumbs-up'd Ben's comment and close message, so I'm guessing it did fix. If so, that may mean that @mwoodpatrick may be experiencing a similar issue with a different root cause.
Since I upgraded to 0.70.4 I didn't experience any freeze. All good for me.
I see the problem even after updating to 0.70.4.
@evakili and @mwoodpatrick Since the person who originally reported this issue has confirmed that it is fixed, it seems that you are likely experiencing a different issue.
There are a number of us who still have the issue (but I, e.g., don't have good data to post). If anyone does open a new issue, please be sure to mention it here, so us "camp followers" can watch that as well. :smile:
I have the same problem. No new wsl shell window can be created, and the wsl -- shutdown window is stuck. Even the window cannot be closed. Version 0.70.4.0 is not downloaded from GitHub. It is the 0.70.4.0 version automatically updated by Windows. It should be the same as GitHub https://github.com/microsoft/WSL/releases/tag/0.70.4 This problem has troubled me for a long time.
PS C:\Users\zhong> wsl --version
WSL 版本: 0.70.4.0
内核版本: 5.15.68.1
WSLg 版本: 1.0.45
MSRDC 版本: 1.2.3575
Direct3D 版本: 1.606.4
DXCore 版本: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows版本: 10.0.22621.674
I will cooperate with you if you need me. thank you
Unfortunately, this is definitely an issue for me too. I confirmed it on 2 different machines, and it makes WSL totally unusable. wsl --shutdown
doesn't work either, so the only solution is to kill wslservice
but it's not a real workaround since VS Code is stuck too...
C:\Users\user>wsl --version
WSL version: 0.70.4.0
Kernel version: 5.15.68.1
WSLg version: 1.0.45
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.819
I, too, had this issue for some time. I got in the habit of closing all WSL windows at the end of the day, and this seemed to help. But I would still occasionally have freezing issues even after installing the fixed WSL version.
So... last week in a terrible and painful mistake, I deleted my WSL virtual disk for the distro I use the most. Since re-creating it, I haven't had this issue at all. Go figure. @XhmikosR, I never had the "totally unusable" experience that you had; it was more of a minor inconvenience. But if it is that bad, you may want to consider redoing your WSL distros.
For what it is worth, I moved from Fedora 36 to Fedora 37. Unsure if that is relevant, though.
The thing is that the issue just started to appear for me a few weeks ago. Before that, everything worked fine.
Now, using Docker + VS Code, breaks quite frequently during the day. The only workaround for me is stopping and starting wslservice
, but it breaks my workflow, totally.
I will try reinstalling Ubuntu, but there are other people having this issue, see also #9114.
I too have been having issues with this. It always occurs when doing a large c++ cmake build (30mins). WSL freezes. This started happening a couple of months ago and basically makes WSL useless. Seems like compiling in Visual Studio in Windows and compiling in WSL makes it worse, but that may be a coincidence. This happens with multiple fresh Ubuntu installations too.
Also get: The Windows Subsystem for Linux service is stopping........ The Windows Subsystem for Linux service could not be stopped.
A reboot is all that works
Version
Microsoft Windows [Version 10.0.22000.918]
WSL Version
Kernel Version
5.15.57.1
Distro Version
Ubuntu 22.04
Other Software
PhpStorm 2022.2.1
Repro Steps
Expected Behavior
PhpStorm finishes files indexing and WSL is usable.
Actual Behavior
WSL freezes completely including any
wsl.exe
command, so evenwsl.exe --shutdown
will hang forever. The only way to restart WSL is by doing a computer restart.Diagnostic Logs
Uploading via Feedback Hub.