microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.26k stars 812 forks source link

Port forwarding repeated failure on WSL 1.1.0 #9508

Closed rudyzeinoun closed 1 year ago

rudyzeinoun commented 1 year ago

Version

Microsoft Windows [Version 10.0.22623.1095]

WSL Version

Kernel Version

5.15.83.1

Distro Version

Ubuntu 20.04

Other Software

Apache/2.4.41 mysqld Ver 10.3.37-MariaDB-0ubuntu0.20.04.1 PHP 8.1 + php8.1fpm systemd enabled in /etc/wsl.conf

Repro Steps

With WSL 1.1.0 (recently pushed to Store although marked as pre-release), port forwarding fails repeatedly. Start Apache on the Ubuntu distro, and from Command Prompt on Windows, try to: telnet localhost 80 It will work. A few seconds later, repeat the telnet command and it will fail. Port forwarding no longer works to connect to WSL Ubuntu's running services.

Expected Behavior

telnet command should keep working on the port.

Actual Behavior

telnet command will timeout.

On first try: netstat -an | findstr /c:"80" | findstr /c:"LISTENING" Shows port 80 as Listening.

After a few seconds, repeat the netstat command, the port is no longer listed. This applies to any service running on WSL, and not just Apache. Port forwarding fails after a few seconds of the service going up.

Restart apache "service apache2 restart". The port will appear on netstat. Wait 10 seconds and check again. It disappears.

Diagnostic Logs

No response

rudyzeinoun commented 1 year ago

Downgrading to 1.0.3 resolves the issue.

$Package = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers Remove-AppxPackage $Package -AllUsers Add-AppxPackage .\Microsoft.WSL_1.0.3.0_x64_ARM64.msixbundle

DarkPhoenix2704 commented 1 year ago

I am having the same issue

Port Forwarding works for a few seconds, then doesn't work.

Distro Version: Ubuntu 20.04 Windows Build: 22621.192 WSL Version: 1.1.0.0

hensou commented 1 year ago

Same issue for me with a nextjs 13 project. Port Forwarding doesn't work.

Distro Version: Archlinux Windows Build: Windows 11 Dev Preview Build 25281 WSL Version: 1.1.0.0

Downgrading it to 1.0.3, as suggested by @rudyzeinoun , did make it work again.

OneBlue commented 1 year ago

/logs

ghost commented 1 year ago

Hello! Could you please provide more logs to help us better diagnose your issue?

To collect WSL logs, download and execute collect-wsl-logs.ps1 in an administrative powershell prompt:

Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1
Set-ExecutionPolicy Bypass -Scope Process -Force
.\collect-wsl-logs.ps1

The scipt will output the path of the log file once done.

Once completed please upload the output files to this Github issue.

Click here for more info on logging

Thank you!

ghost commented 1 year ago

Thanks for reporting this. I need logs from someone because I can't reproduce. I installed apache2, default config.

rudyzeinoun commented 1 year ago

Here's my log. I removed 1.0.3 and reinstalled 1.1.0 from the Store. The problem is reproducible again. During the log capture, the port forwarding was already gone. I restarted apache2. The port came back up. Did 1 connection to it, and then it died, as expected.

WslLogs-2023-01-20_09-58-14.zip

maicol07 commented 1 year ago

I confirm the same issue. 1 connection is successful and then I can't access it anymore. Here is my log if it can help: WslLogs-2023-01-20_14-20-05.zip

john-henry commented 1 year ago

I am getting similar problem I believe. My experience is with starting up DDEV with Docker.


Error response from daemon: Ports are not available: exposing port TCP 127.0.0.1:443 -> 0.0.0.0:0: listen tcp 127.0.0.1:443: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.'`
elsaco commented 1 year ago

With WSL 1.1.0 it's connecting over IPv6 only. This is on Windows 10. The web server is running inside a WSL instance:

/usr/local/bin> ss -lt
State         Recv-Q        Send-Q               Local Address:Port               Peer Address:Port       Process
LISTEN        0             128                      127.0.0.1:ipp                     0.0.0.0:*
LISTEN        0             4096                             *:http                          *:*
LISTEN        0             128                          [::1]:ipp                        [::]:*

On Windows side 127.0.0.1 fails to connect:

PS C:\Windows\system32> .\curl.exe -4 -v localhost
*   Trying 127.0.0.1:80...
* connect to 127.0.0.1 port 80 failed: Connection refused
* Failed to connect to localhost port 80 after 2051 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 80 after 2051 ms: Connection refused

but ::1 works, and the connection is steady:

PS C:\Windows\system32> .\curl.exe -6 -v localhost
*   Trying ::1:80...
* Connected to localhost (::1) port 80 (#0)
> GET / HTTP/1.1
> Host: localhost
> User-Agent: curl/7.83.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 403 Forbidden
< Date: Fri, 20 Jan 2023 17:59:08 GMT
< Server: Apache

More info:

PS C:\Windows\system32> Get-NetTCPConnection -localport 80 | ft -AutoSize

LocalAddress LocalPort RemoteAddress RemotePort State  AppliedSetting OwningProcess
------------ --------- ------------- ---------- -----  -------------- -------------
::1          80        ::            0          Listen                2972

PS C:\Windows\system32> Get-Process -id 2972

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    150      12     1604       7588       0.05   2972   1 wslhost
ghost commented 1 year ago

Ahhh, Thank-you! I was wondering. Is it bound to ::1 on the guest/wsl side?

ghost commented 1 year ago

From the logs I've looked at it appears the binds on the guest are port 80 ipv6

This means that the relay is going to bind the host(windows side) at ::1 port 80.

ghost commented 1 year ago

@rudyzeinoun, I checked your logs and saw the relay bound ::1 80 on the host side, forwarded two connections, and then closed the listening socket 6 seconds later. Did you already verify that the port is still bound on the guest side inside WSL? Using something similar to sudo netstat -lnpt

ghost commented 1 year ago

Non deterministic repro instructions here: https://github.com/microsoft/WSL/issues/9516#issuecomment-1400557869 . If someone is experiencing failure to forward after a period other than exactly 1 minute then please say. Because in that case there may be a second issue. Thanks again for your patience.

ghost commented 1 year ago

Was able to reproduce the bug where the port tracker stops tracking immediately.

rudyzeinoun commented 1 year ago

@rudyzeinoun, I checked your logs and saw the relay bound ::1 80 on the host side, forwarded two connections, and then closed the listening socket 6 seconds later. Did you already verify that the port is still bound on the guest side inside WSL? Using something similar to sudo netstat -lnpt

Yes, the port is still bound on the guest side. If I restart the service (unbind, bind again), it works for another 1 or 2 connections and disappears from host side again. I see you're able to reproduce it now. 👍

cheynewallace commented 1 year ago

Im having the same issue after updating (automatically via app store) to 1.1.0. I didn't realize It had updated until I went digging around these issues and found this ticket.

Wasted half a day on this yesterday disabling firewalls and resetting networks and ended up giving up and switching to my Macbook so I could actually work before I discovered this.

The issue was as described here, I could usually boot my machine, open WSL, run docker and my dev API, it would work using localhost from the Windows host, in any browser for a about 30 seconds, then it would just stop working. Connection direct to the dynamic IP worked, localhost did not.

Performing wsl --shutdown did nothing and did not restore the localhost routing.

Downgrading to 1.0.3 as suggested by @rudyzeinoun did solve the issue.

CodeW0lf commented 1 year ago

WslLogs-2023-01-24_00-11-56.zip My experience with this is that having one port forwarded is fine, but as soon as another port attempts to be forwarded, the first port will get disconnected and no further forwarding can occur until wsl --shutdown.

My dev environment starts two servers, one on port 4200 and one on 4201. If I only start one of them and no other services that require a forwarded port, the connection and forwarded port is stable. Seconds after attempting to spawn another service that would require a forwarded port, all port forwarding stops working.

Services are still accessible from within the WSL distribution no matter what is going on with the forwarding.

dylanirion commented 1 year ago

I believe I am experiencing this issue via aws-sam-cli where when running a lambda function locally, docker hits an exception mapping ports Ports are not available: exposing port TCP 127.0.0.1:5400 -> 0.0.0.0:0: listen tcp 127.0.0.1:5400: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

xcellsoft commented 1 year ago

Similar issues with running docker swarm stack, unable to bind ports. One of our dev machines got an update over the weekend, took us better part of the day to realize it was WSL 1.1.0 update. We downgraded and all is fixed.

Docker stack error snippet: skutok2kxqwb4sqhjy3cgnwza    \_ mysql_db.1               mysql:5.7.39@sha256:a85b8313feb7298ae240c4beb33a1b4d2e3a3867d3195bab9ed9346d332217c7                                           yoda      Shutdown        Failed less than a second ago     "starting container failed: container b5a3534fc47c51db1edd4a91b4b948b490768b98a25ab533f52121f0809e37b0: endpoint join on GW Network failed: driver failed programming external connectivity on endpoint gateway_e0c5563d9768 (13ddd6d5b49bc7228ed30f198fb44e549fc2e905f077bfbf84ed18b6ef8754f7): Error starting userland proxy: listen tcp4 0.0.0.0:3306: bind: address already in use"

yejiyang commented 1 year ago

Downgrading to 1.0.3 resolves the issue.

$Package = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers Remove-AppxPackage $Package -AllUsers Add-AppxPackage .\Microsoft.WSL_1.0.3.0_x64_ARM64.msixbundle

This works for me, great thanks!

jayg-hive commented 1 year ago

I tried to downgrade to 1.0.3 as well but I'm still not able to get the port forwarding to work. 😢 image

On Windows:

PS C:\Windows\System32> .\curl.exe -4 -v localhost:3000
*   Trying 127.0.0.1:3000...
* connect to 127.0.0.1 port 3000 failed: Connection refused
* Failed to connect to localhost port 3000 after 2050 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 3000 after 2050 ms: Connection refused
PS C:\Windows\System32> .\curl.exe -6 -v localhost:3000
*   Trying ::1:3000...
* connect to ::1 port 3000 failed: Connection refused
* Failed to connect to localhost port 3000 after 2026 ms: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 3000 after 2026 ms: Connection refused
PS C:\Windows\System32> wsl --version
WSL version: 1.0.3.0
Kernel version: 5.15.79.1
WSLg version: 1.0.47
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22623.1180
PS C:\Windows\System32>

Pretty much did most of the stuff (i.e., firewall disable on vEthernet, disable Fast Startup) to no effect. This is also a Next.js app as the one of the posters earlier. WslLogs-2023-01-25_11-55-50.zip

elsaco commented 1 year ago

@jayg-hive open a debug console by adding debugConsole = true to your .wslconfig file. After a wsl --shutdown you'll see a console opened when launching a distro. If you see this kind of messages:

GnsPortTracker: Requested the host for port allocation on port (family 2, port 3000, protocol 6) - returned 0
GnsPortTracker: Tracking bind call: family (2) port (3000) protocol (6)
SecCompDispatcher: Responding to notification with id 6934181779582174820 for pid 928, result 0
GnsPortTracker: No longer tracking bind call: family (2) port (3000) protocol (6)

you're affected by the port proxy issue.

jayg-hive commented 1 year ago

@jayg-hive open a debug console by adding debugConsole = true to your .wslconfig file. After a wsl --shutdown you'll see a console opened when launching a distro. If you see this kind of messages:

GnsPortTracker: Requested the host for port allocation on port (family 2, port 3000, protocol 6) - returned 0
GnsPortTracker: Tracking bind call: family (2) port (3000) protocol (6)
SecCompDispatcher: Responding to notification with id 6934181779582174820 for pid 928, result 0
GnsPortTracker: No longer tracking bind call: family (2) port (3000) protocol (6)

you're affected by the port proxy issue.

@elsaco Unfortunately I'm getting a different one: WSL-Debug-Logs-2023-01-25.txt

[    8.741519] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[    8.742121] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[    8.742747] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[    8.743611] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[    8.744353] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[    8.744952] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[    8.745537] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[    9.140648] misc dxg: dxgk: dxgkio_query_adapter_info: Ioctl failed: -2
[   17.907860] TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised.
[   49.891549] hv_balloon: Max. dynamic memory size: 7894 MB

Not even a blip on the last line after I tried another curl request. 😭

jayg-hive commented 1 year ago

For anyone developing on WSL and just want to see their next dev run, a workaround I've found is to follow the steps here to run Google Chrome on WSL, then running google-chrome. It's not as intuitive as having localhost access directly on Window though. 🤷

ghost commented 1 year ago

@jayg-hive, if you're experiencing problems with port forwarding on 1.0.3 it is not the same as the problem as this bug. 1.1.0 tracks ports via a different means.

Dnouv commented 1 year ago

Hello @pmartincic

I can confirm this issue occurs with the latest 1.1.0.0. Here is the issue I faced:

  1. Tried to start the development server of Rocket.Chat, starts two services, one on 3001 and the other on 3000
  2. After a minute, the port 3001 process is stopped in the "Resource Monitor> Listening Ports", following which port 3000 also stops.
  3. However, they are accessible on wsl_ip:3000 and wsl_ip:3001.

I tried the following troubleshooting steps:

  1. Restart WSL (wsl --shutdown)
  2. Reset the WSL Ubuntu distribution, and installed a fresh one.
  3. Added localhostForwarding=true in .wslconfig
  4. Tried the netsh portproxy.

However, still no luck.

Any help would be really appreciated. Thank you!

WSL details:

WSL version: 1.1.0.0
Kernel version: 5.15.83.1
ghost commented 1 year ago

@Dnouv, I'm sorry you're running into this. For the time being you should follow the instructions above and revert back to 1.0.3 I'm tracing the root cause of the bug and it will require a code change to fix.

Tofandel commented 1 year ago

Also had the same issue after trying the prerelease

Just adding a note that to run

Downgrading to 1.0.3 resolves the issue.

$Package = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers Remove-AppxPackage $Package -AllUsers Add-AppxPackage .\Microsoft.WSL_1.0.3.0_x64_ARM64.msixbundle

You need to download the 1.0.3 release and run that in an Admin Powershell where the file was downloaded

astroboylrx commented 1 year ago

Also had this issue. I'm using jupyter notebook on WSL2 and have been accessing it via localhost:8888 from Windows without issues until this update.

With this update, sometimes I can access the notebook from Windows browser, but it lost connection after a short while, exactly matching what has been described in many cases above. Sometimes I just cannot access from the very beginning (I guess the "short while" were too short). This behavior is consistent regardless of python version, port used (8888, 8999, etc.), etc.

Also, sometimes jupyter reported that "address is already in use", but I cannot find any process in WSL2 occupying that address or port. I did notice that there are many kubernetes seemingly listening (?? not sure) to the same port every time when I start a Jupyter Notebook. The following results are from netstat -a on Powershell, where only the first line was due to my jupyter and I have no idea about the other lines that show "ESTABLISHED" with local address on 8891. If I kill jupyter, these lines would become "TIME_WAIT" in State, super confusing.

  Proto  Local Address          Foreign Address        State
  TCP    127.0.0.1:8891         VectorPro:0            LISTENING
  TCP    127.0.0.1:8891         kubernetes:51023       ESTABLISHED
  TCP    127.0.0.1:8891         kubernetes:51035       ESTABLISHED
  TCP    127.0.0.1:8891         kubernetes:51036       ESTABLISHED
  TCP    127.0.0.1:8891         kubernetes:51037       ESTABLISHED
  TCP    127.0.0.1:8891         kubernetes:51038       ESTABLISHED
  TCP    127.0.0.1:8891         kubernetes:51039       ESTABLISHED
  TCP    127.0.0.1:8891         kubernetes:51086       ESTABLISHED
  TCP    127.0.0.1:9656         VectorPro:0            LISTENING

After downgrading to 1.0.3, I don't see kubernetes anymore under the port of my jupyter notebook.

elsaco commented 1 year ago

@astroboylrx when you kill jupyter your host keeps the TCP endpoint(s) for longer just in case some late packet(s) might arrive. They close after about 2 min or whatever the MSL is setup to. During this time they are in TIME_WAIT state. Run netstat -a after 4-5 min and those connections will be gone.

pedrolamas commented 1 year ago

Yesterday I manually removed WSL 1.1.0 and reinstalled 1.0.3 to fix this issue (which it did!), but this morning I turned on my machine and it is back to 1.1.0...

Is there a way I can stop Windows Store from updating WSL to 1.1.0?

Tofandel commented 1 year ago

Hmm I don't know what's up, but the downgrade to 1.0.3 made it worse, now I can't access WSL at all on localhost or 127.0.0.1 always get connection refused, in 1.1.0 I at least could connect for a few seconds

I also tried 1.0.1 and no luck, it must have changed some system settings or something..

Edit: Ok I found the issue, I previously added this to .wslconfig to see if it would fix the issue (which it didn't)

[wsl2]
kernelCommandLine=ipv6.disable=1

And removing it made 1.0.3 work again

Cremesis commented 1 year ago

Yeah same problem here, I had to revert back to 1.0.3

ghost commented 1 year ago

Fix is currently in testing.

nickchomey commented 1 year ago

I wasted like 30 hours trying to figure out what was suddenly going wrong with my locally developed webapp, until I found this issue. I rolled back to 1.0.3 and it works again. A few things to perhaps note:

Note for WSL developers

If you need another datapoint to test with, I am using the Devilbox local development environment. You should be able to install and deploy it quite quickly with docker and should see the issue replicate even if you dont even create a vhost webapp - even just the dashboard at http://localhost stops working after a few minutes.

https://github.com/cytopia/devilbox

Here's an issue I opened there with a lot of details, screen recording etc... if you want to take a look.

https://github.com/cytopia/devilbox/issues/954

ghost commented 1 year ago

@nickchomey , I completely get wanting to back up your distros. However, re-importing is unnecessary unless the packaging framework borks your distro.

nickchomey commented 1 year ago

Perhaps it isn't necessary for someone with WSL installed via the Microsoft Store or directly from the downloads in this repo. However, given that I was removing everything (WSL and Ubuntu) that was installed via wsl --install - in order to move to the direct install from this repo - I figured it was very likely that the distros would get borked.

Anyway, I look forward to whatever fix will get released for this

realmrv commented 1 year ago

As a non-downgrading workaround, use IP address of eth0 WSL virtual adapter instead of localhost address. To get it, run ifconfig inside WSL or run wsl hostname -I in a Windows terminal and select the first one in the resulting list.

clmnin commented 1 year ago

I had installed WSL2 using wsl --install and I followed the steps shared by @rudyzeinoun

$Package = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers 
Remove-AppxPackage $Package -AllUsers
Add-AppxPackage .\Microsoft.WSL_1.0.3.0_x64_ARM64.msixbundl

Downgrading to 1.0.3 works.

bplasmeijer commented 1 year ago

It also solved my problem, downgrading to 1.0.3

Major  Minor  Build  Revision
-----  -----  -----  --------
10     0      22621  0

This issue should be solved soon, did spend hours debugging because thinking about VPN or network-related issues.

1.0.3
WslLogs-2023-01-30_08-57-55.zip 1.1.0 WslLogs-2023-01-30_09-06-46.zip

A simple test with kind

kind create cluster --name test on 1.1.0 The connection to the server 127.0.0.1:40715 was refused - did you specify the right host or port? on 1.0.3

k get pods -A
NAMESPACE            NAME                                        READY   STATUS    RESTARTS      AGE
kube-system          coredns-565d847f94-hrcfs                    0/1     Running   5 (23s ago)   35m
kube-system          coredns-565d847f94-tfjwp                    0/1     Running   5 (23s ago)   35m
kube-system          etcd-abc-control-plane                      1/1     Running   5 (23s ago)   35m
kube-system          kindnet-txnq5                               1/1     Running   5 (23s ago)   35m
kube-system          kube-apiserver-abc-control-plane            1/1     Running   5 (23s ago)   35m
kube-system          kube-controller-manager-abc-control-plane   1/1     Running   5 (23s ago)   35m
kube-system          kube-proxy-ckkp2                            1/1     Running   5 (23s ago)   35m
kube-system          kube-scheduler-abc-control-plane            1/1     Running   5 (23s ago)   35m
local-path-storage   local-path-provisioner-684f458cdd-zzqfg     1/1     Running   7 (23s ago)   35m

cc: @craigloewen-msft @bitcrazed @benhillis

feedback hub: https://aka.ms/AAjeywc

bplasmeijer commented 1 year ago

@craigloewen-msft @benhillis can we please label this as a bug?

ghost commented 1 year ago

@bplasmeijer, it's actively being worked on and treated as a bug. Unfortunately we don't make religious use of the labels. Yesterday in testing, another issue was found with the release that was about to go out the door to fix this issue. That's been addressed.

bplasmeijer commented 1 year ago

@bplasmeijer, it's actively being worked on and treated as a bug. Unfortunately we don't make religious use of the labels. Yesterday in testing, another issue was found with the release that was about to go out the door to fix this issue. That's been addressed.

Thanks 🙏 wsl team, and @pmartincic Labels can give visibility to the wsl consumers. Any insides on the bug would be appreciated. Many hours trying to find the issue.😬

Cc: @craigloewen-msft

nuttakit commented 1 year ago

was downgrading to 1.0.3 and work fine yesterday. it broken again today.

wsl --version WSL version: 1.0.3.0 Kernel version: 5.15.79.1 WSLg version: 1.0.47 MSRDC version: 1.2.3575 Direct3D version: 1.606.4 DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows version: 10.0.22621.1194

found out that with 1.0.3 I have to remove kernelCommandLine=ipv6.disable=1 as well to make it work

ghost commented 1 year ago

Sorry to hear that @nuttakit, If you're on anything on or before 1.0.3 you are experiencing a different issue.

codebymikey commented 1 year ago

Can someone please confirm if there's any potential for loss of data when attempting the downgrade back to 1.0.3 from a previous wsl --update?

$Package = Get-AppxPackage MicrosoftCorporationII.WindowsSubsystemforLinux -AllUsers 
Remove-AppxPackage $Package -AllUsers
Add-AppxPackage .\Microsoft.WSL_1.0.3.0_x64_ARM64.msixbundle

The Remove-AppxPackage command looks particularly scary since I'm not sure if the WSL mount data of individual Distribution packages is somehow coupled with the WSL package (it doesn't appear so, but better safe than sorry!).

ln8711 commented 1 year ago

@codebymikey your data is stored in .vhdx file (if you are using wsl 2) so this can't affect the data but better safe than sorry, backup first

nickchomey commented 1 year ago

You should never assume that data is safe. As I said in my comment somewhere above, export your wsl distros and then do the downgrade. You can import them afterwards.

Cremesis commented 1 year ago

I see a new release is available on the Microsoft Store (1.1.2) and it looks like the bug is fixed.