Open MichiBab opened 8 months ago
Hi I'm an AI powered bot that finds similar issues based off the issue title.
Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it. Thank you!
Note: You can give me feedback by thumbs upping or thumbs downing this comment.
Could you please follow the steps below and attach the diagnostic logs? https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues
Hi, i collected them with a 10 sec forced timeout, as described in my ticket. ❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: 10.0 seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Is Port in use: False --- Runtime: 10.010206699371338 seconds --- https://github.com/microsoft/WSL/files/13559172/WslLogs-2023-12-05_12-02-48.zip
Please follow the networking diagnostic script. https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues
It should create a folder like [WslNetworkingLogs-date_ver.zip]
WslNetworkingLogs-2023-12-07_21-14-32.zip Updated, here are the new logs @chanpreetdhanjal .
I have the same issue with WSL 2.0.14.0
Still the same problem with 2.1.0
same problem here with 2.1.0. It is frustrating because some applications expect a conneciton refused when performing port checks with telnet
Still the same problem with 2.1.1
When I go through windows telnet localhost [port]
the access is very slow, but when I go through telnet 127.0.0.1 [port]
or telnet native ip [port]
it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6
netsh interface ipv6 show prefixpolicies
Default policy
Priority label prefix
---------- ----- --------------------------------
50 0 ::1/128
40 1 ::/0
35 4 ::ffff:0:0/96
30 2 2002::/16
5 5 2001::/32
3 13 fc00::/7
1 11 fec0::/10
1 12 3ffe::/16
1 3 ::/96
I've changed the order of their priorities
PS: required Administrator
netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12
Modified policy
Priority label prefix
---------- ----- --------------------------------
50 0 ::/96
40 1 ::ffff:0:0/96
35 2 2002::/16
30 3 2001::/32
10 4 ::1/128
5 5 ::/0
3 13 fc00::/7
1 12 3ffe::/16
1 11 fec0::/10
now telnet 127.0.0.1 [port]
is getting fast
But I don't know what other problems such a change would cause
It can be reset by the following command
netsh interface ipv6 reset
PS: reboot your system
When I go through windows
telnet localhost [port]
the access is very slow, but when I go throughtelnet 127.0.0.1 [port]
ortelnet native ip [port]
it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6
netsh interface ipv6 show prefixpolicies
Default policy
Priority label prefix ---------- ----- -------------------------------- 50 0 ::1/128 40 1 ::/0 35 4 ::ffff:0:0/96 30 2 2002::/16 5 5 2001::/32 3 13 fc00::/7 1 11 fec0::/10 1 12 3ffe::/16 1 3 ::/96
I've changed the order of their priorities PS: required Administrator
netsh int ipv6 set prefix ::/96 50 0 netsh int ipv6 set prefix ::ffff:0:0/96 40 1 netsh int ipv6 set prefix 2002::/16 35 2 netsh int ipv6 set prefix 2001::/32 30 3 netsh int ipv6 set prefix ::1/128 10 4 netsh int ipv6 set prefix ::/0 5 5 netsh int ipv6 set prefix fc00::/7 3 13 netsh int ipv6 set prefix fec0::/10 1 11 netsh int ipv6 set prefix 3ffe::/16 1 12
Modified policy
Priority label prefix ---------- ----- -------------------------------- 50 0 ::/96 40 1 ::ffff:0:0/96 35 2 2002::/16 30 3 2001::/32 10 4 ::1/128 5 5 ::/0 3 13 fc00::/7 1 12 3ffe::/16 1 11 fec0::/10
now
telnet 127.0.0.1 [port]
is getting fastBut I don't know what other problems such a change would cause
It can be reset by the following command
netsh interface ipv6 reset
PS: reboot your system
not work for me.
When I go through windows
telnet localhost [port]
the access is very slow, but when I go throughtelnet 127.0.0.1 [port]
ortelnet native ip [port]
it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6netsh interface ipv6 show prefixpolicies
Default policyPriority label prefix ---------- ----- -------------------------------- 50 0 ::1/128 40 1 ::/0 35 4 ::ffff:0:0/96 30 2 2002::/16 5 5 2001::/32 3 13 fc00::/7 1 11 fec0::/10 1 12 3ffe::/16 1 3 ::/96
I've changed the order of their priorities PS: required Administrator
netsh int ipv6 set prefix ::/96 50 0 netsh int ipv6 set prefix ::ffff:0:0/96 40 1 netsh int ipv6 set prefix 2002::/16 35 2 netsh int ipv6 set prefix 2001::/32 30 3 netsh int ipv6 set prefix ::1/128 10 4 netsh int ipv6 set prefix ::/0 5 5 netsh int ipv6 set prefix fc00::/7 3 13 netsh int ipv6 set prefix fec0::/10 1 11 netsh int ipv6 set prefix 3ffe::/16 1 12
Modified policy
Priority label prefix ---------- ----- -------------------------------- 50 0 ::/96 40 1 ::ffff:0:0/96 35 2 2002::/16 30 3 2001::/32 10 4 ::1/128 5 5 ::/0 3 13 fc00::/7 1 12 3ffe::/16 1 11 fec0::/10
now
telnet 127.0.0.1 [port]
is getting fast But I don't know what other problems such a change would cause It can be reset by the following commandnetsh interface ipv6 reset
PS: reboot your systemnot work for me.
Are you using powershell or cmd on windows ?
On WSL use python open a http server
@linG5821 We have a different issue. The problem is that the port is not listening, and it not refusing connections immediately.
I did find problems with changing the default NIC policy in practice
I mapped ports when using Docker, I tried '127.0.0.1:port' and 'localhost:port' both were accessible, but I used the Intranet ip 'ipv4ip:port' and couldn't access it, when I reset the network, Intranet ip 'ipv4ip:port' and '127.0.0.1:port' are accessible, and access to 'localhost:port' comes back to the issue itself.
Still the same issue with 2.1.3.0
Still the same issue on 2.1.5. @chanpreetdhanjal is there any way to help? I have a workaround for ROS but it's a bit annoying on tensorboard to wait that long after fetching plots.
Someone from our team is looking at it. We will get back to you as soon as we make further progress. Thank you for your patient.
Can't believe MS team still not resolve this bug. It's pretty annoying problem, especially when I am going to using the vscode JS debug terminal in WSL2 mirrored network mode.
We are actively looking at it as per other items in our priority list. Will provide an update as soon as we have something tangible to share.
To update this thread: there are 2 possible issues.
when attempting to connect by name "localhost", the IPv6 address ::1 is likely resolved first (v4 vs. v6 is configurable in Linux). Linux does not support the option for IPv6 that we use to allow sending localhost traffic to the host - it's only supported with IPv4. So unfortunately we cannot use the :;1 address to connect from the Linux container to the host.
when a connection is attempted by IPv4 127.0.0.1 to the host, but nothing is listening on the host, a RST is returned. We see that something has attempted to port-forward that response because the Linux stack eventually sees this with a different source port (which of course is dropped). We're currently working with Linux folks as we have tracked the response through a kernel debugger to the point where we send it over the vmBus to the Linux container.
Sorry, the RST issue is taking much longer to root cause than expected :(
Hope this helps
Has there been any progress on this issue?
we have spent a lot of cycles debugging down the windows stack, through vswitch, to the vm - this has been really difficult and we don't yet have a culprit. we are fixing some Docker related issues - once that goes out, we'll return back to finding this bug.
when a connection is attempted by IPv4 127.0.0.1 to the host, but nothing is listening on the host, a RST is returned. We see that something has attempted to port-forward that response because the Linux stack eventually sees this with a different source port (which of course is dropped). We're currently working with Linux folks as we have tracked the response through a kernel debugger to the point where we send it over the vmBus to the Linux container.
indeed, when capturing packets on Windows and Linux (WSL), the port is wrong.
Windows host linux(WSL)
HV interface FSE switch loopback0
syn 222 <- 60844 222 <- 60844
00:11:22:33:44:55 <-------------- 00:15:5d:bd:df:a8
rst 222 -> 60844 222 -> 45634
^^^^^???
00:15:5d:61:13:45 ----------------------> 00:15:5d:bd:df:a8
destination MAC address Windows sent the packet to is loopback0 (WSL). MAC address received by Linux (WSL) is the one sent by Windows.
if both perceptions are correct, then maybe something is going on inside the FSE switch?
hostAddressLoopback=true & eth0 address will back to destination port Windows sent it to, which is fine. i think it is basically the same mechanism, but the behavior is different.
I'm having the same issue. Just got upgraded to Win11 22H2 and have been waiting to try mirrored networking. It did not take long to find this issue. Please fix.
Using WSL 2.2.4.0
same issue with 2.2.4.0 and the Intranet ip 'ipv4ip:port' and couldn't access it
Windows Version
Microsoft Windows [Version 10.0.22631.2715]
WSL Version
2.0.9.0 and 2.0.14 tested
Are you using WSL 1 or WSL 2?
Kernel Version
5.15.133.1-1
Distro Version
Ubuntu 22.04
Other Software
No response
Repro Steps
Call this script in mirrored mode as normal user or sudo (try both):
Expected Behavior
It directly checks if the socket is in use or not.
Actual Behavior
Sometimes it takes over 132 seconds to return True or False. In not mirrored mode it always directly returns. I came across this issue using ros, where the roscore takes about 2 minutes to start, or starting tensorboard takes around 2 minutes aswell, so i build this script to reproduce the error.
If it directly returns true or false in mirrored mode, i start the script as sudo and then the problem occurs again. What is interesting, is that if i set the socket timeout to a specific second, then it will return after that. Its like its blocked for the default timeout.
Diagnostic Logs
Without networking mode mirrored:
❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: None seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: None Is Port in use: False --- Runtime: 0.0001404285430908203 seconds ---
With networking mode mirrored:
❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: None seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: None Is Port in use: False --- Runtime: 132.54164338111877 seconds ---
Sometimes it works as fast as without mirrored. But sometimes not, even tho it results in the port is not in use. If it works directly, starting the script with sudo will result again in this timeout.
With networking mode mirrored and setting timeout to a second with: s.settimeout(1) before the connect_ex call
❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: 1.0 seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: None Is Port in use: False --- Runtime: 1.001239538192749 seconds ---
Or with settimeout to 2 seconds:
❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: 2.0 seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: 2.0 Is Port in use: False --- Runtime: 2.0022153854370117 seconds ---