microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.07k stars 797 forks source link

Socket Connect hangs until timeout runs out in mirrored mode. #10855

Open MichiBab opened 8 months ago

MichiBab commented 8 months ago

Windows Version

Microsoft Windows [Version 10.0.22631.2715]

WSL Version

2.0.9.0 and 2.0.14 tested

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.133.1-1

Distro Version

Ubuntu 22.04

Other Software

No response

Repro Steps

Call this script in mirrored mode as normal user or sudo (try both):

import socket
import time
import os

def is_port_in_use(port):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        # s.settimeout(2)  # set timeout directly

        print(f"Socket type: {s.type}")
        print(f"Socket family: {s.family}")
        print(f"Socket timeout: {s.gettimeout()} seconds")
        # Get and print socket options
        print(f"SO_REUSEADDR option: {s.getsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR)}")
        print(f"SO_KEEPALIVE option: {s.getsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE)}")
        print(f"TCP_NODELAY option: {s.getsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY)}")
        print("checking: ", port)
        print("Timeout: ",s.gettimeout())
        return s.connect_ex(("localhost", port)) == 0

print("Uid: ", os.getuid())
start_time = time.time()
print("Is Port in use:", is_port_in_use(6006))
print("--- Runtime: %s seconds ---" % (time.time() - start_time))

Expected Behavior

It directly checks if the socket is in use or not.

Actual Behavior

Sometimes it takes over 132 seconds to return True or False. In not mirrored mode it always directly returns. I came across this issue using ros, where the roscore takes about 2 minutes to start, or starting tensorboard takes around 2 minutes aswell, so i build this script to reproduce the error.

If it directly returns true or false in mirrored mode, i start the script as sudo and then the problem occurs again. What is interesting, is that if i set the socket timeout to a specific second, then it will return after that. Its like its blocked for the default timeout.

Diagnostic Logs

Without networking mode mirrored:

❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: None seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: None Is Port in use: False --- Runtime: 0.0001404285430908203 seconds ---

With networking mode mirrored:

❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: None seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: None Is Port in use: False --- Runtime: 132.54164338111877 seconds ---

Sometimes it works as fast as without mirrored. But sometimes not, even tho it results in the port is not in use. If it works directly, starting the script with sudo will result again in this timeout.

With networking mode mirrored and setting timeout to a second with: s.settimeout(1) before the connect_ex call

❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: 1.0 seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: None Is Port in use: False --- Runtime: 1.001239538192749 seconds ---

Or with settimeout to 2 seconds:

❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: 2.0 seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Timeout: 2.0 Is Port in use: False --- Runtime: 2.0022153854370117 seconds ---

github-actions[bot] commented 8 months ago

Hi I'm an AI powered bot that finds similar issues based off the issue title.

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it. Thank you!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

chanpreetdhanjal commented 7 months ago

Could you please follow the steps below and attach the diagnostic logs? https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

MichiBab commented 7 months ago

Hi, i collected them with a 10 sec forced timeout, as described in my ticket. ❯ python3 test_connect.py Uid: 1000 Socket type: 1 Socket family: 2 Socket timeout: 10.0 seconds SO_REUSEADDR option: 0 SO_KEEPALIVE option: 0 TCP_NODELAY option: 0 checking: 6006 Is Port in use: False --- Runtime: 10.010206699371338 seconds --- https://github.com/microsoft/WSL/files/13559172/WslLogs-2023-12-05_12-02-48.zip

chanpreetdhanjal commented 7 months ago

Please follow the networking diagnostic script. https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

It should create a folder like [WslNetworkingLogs-date_ver.zip]

MichiBab commented 7 months ago

WslNetworkingLogs-2023-12-07_21-14-32.zip Updated, here are the new logs @chanpreetdhanjal .

markovaljaots commented 7 months ago

I have the same issue with WSL 2.0.14.0

MichiBab commented 6 months ago

Still the same problem with 2.1.0

derMatze82 commented 6 months ago

same problem here with 2.1.0. It is frustrating because some applications expect a conneciton refused when performing port checks with telnet

MichiBab commented 6 months ago

Still the same problem with 2.1.1

linG5821 commented 6 months ago

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6

netsh interface ipv6 show prefixpolicies

Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities
PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast

But I don't know what other problems such a change would cause

It can be reset by the following command netsh interface ipv6 reset PS: reboot your system

zed76r commented 6 months ago

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6

netsh interface ipv6 show prefixpolicies

Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast

But I don't know what other problems such a change would cause

It can be reset by the following command netsh interface ipv6 reset PS: reboot your system

not work for me.

image

linG5821 commented 6 months ago

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6 netsh interface ipv6 show prefixpolicies Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast But I don't know what other problems such a change would cause It can be reset by the following command netsh interface ipv6 reset PS: reboot your system

not work for me.

image

Are you using powershell or cmd on windows ?

On WSL use python open a http server image

image

zed76r commented 6 months ago

@linG5821 We have a different issue. The problem is that the port is not listening, and it not refusing connections immediately.

linG5821 commented 6 months ago

I did find problems with changing the default NIC policy in practice

I mapped ports when using Docker, I tried '127.0.0.1:port' and 'localhost:port' both were accessible, but I used the Intranet ip 'ipv4ip:port' and couldn't access it, when I reset the network, Intranet ip 'ipv4ip:port' and '127.0.0.1:port' are accessible, and access to 'localhost:port' comes back to the issue itself.

MichiBab commented 5 months ago

Still the same issue with 2.1.3.0

MichiBab commented 4 months ago

Still the same issue on 2.1.5. @chanpreetdhanjal is there any way to help? I have a workaround for ROS but it's a bit annoying on tensorboard to wait that long after fetching plots.

chanpreetdhanjal commented 4 months ago

Someone from our team is looking at it. We will get back to you as soon as we make further progress. Thank you for your patient.

zed76r commented 2 months ago

Can't believe MS team still not resolve this bug. It's pretty annoying problem, especially when I am going to using the vscode JS debug terminal in WSL2 mirrored network mode.

chanpreetdhanjal commented 2 months ago

We are actively looking at it as per other items in our priority list. Will provide an update as soon as we have something tangible to share.

keith-horton commented 2 months ago

To update this thread: there are 2 possible issues.

  1. when attempting to connect by name "localhost", the IPv6 address ::1 is likely resolved first (v4 vs. v6 is configurable in Linux). Linux does not support the option for IPv6 that we use to allow sending localhost traffic to the host - it's only supported with IPv4. So unfortunately we cannot use the :;1 address to connect from the Linux container to the host.

  2. when a connection is attempted by IPv4 127.0.0.1 to the host, but nothing is listening on the host, a RST is returned. We see that something has attempted to port-forward that response because the Linux stack eventually sees this with a different source port (which of course is dropped). We're currently working with Linux folks as we have tracked the response through a kernel debugger to the point where we send it over the vmBus to the Linux container.

Sorry, the RST issue is taking much longer to root cause than expected :(

Hope this helps

dhtzs commented 1 month ago

Has there been any progress on this issue?

keith-horton commented 1 month ago

we have spent a lot of cycles debugging down the windows stack, through vswitch, to the vm - this has been really difficult and we don't yet have a culprit. we are fixing some Docker related issues - once that goes out, we'll return back to finding this bug.

shigenobuokamoto commented 1 month ago

when a connection is attempted by IPv4 127.0.0.1 to the host, but nothing is listening on the host, a RST is returned. We see that something has attempted to port-forward that response because the Linux stack eventually sees this with a different source port (which of course is dropped). We're currently working with Linux folks as we have tracked the response through a kernel debugger to the point where we send it over the vmBus to the Linux container.

indeed, when capturing packets on Windows and Linux (WSL), the port is wrong.

Windows host                                  linux(WSL)
       HV interface     FSE switch         loopback0
syn    222 <- 60844                         222 <- 60844
         00:11:22:33:44:55 <-------------- 00:15:5d:bd:df:a8

rst    222 -> 60844                         222 -> 45634
                                                   ^^^^^???
00:15:5d:61:13:45  ----------------------> 00:15:5d:bd:df:a8

destination MAC address Windows sent the packet to is loopback0 (WSL). MAC address received by Linux (WSL) is the one sent by Windows.

if both perceptions are correct, then maybe something is going on inside the FSE switch?

hostAddressLoopback=true & eth0 address will back to destination port Windows sent it to, which is fine. i think it is basically the same mechanism, but the behavior is different.

drolander commented 1 week ago

I'm having the same issue. Just got upgraded to Win11 22H2 and have been waiting to try mirrored networking. It did not take long to find this issue. Please fix.

Using WSL 2.2.4.0

lihk11 commented 1 week ago

same issue with 2.2.4.0 and the Intranet ip 'ipv4ip:port' and couldn't access it