microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.18k stars 805 forks source link

WSL network crashes after 100 ssh connections #11759

Open DomRakowski opened 1 month ago

DomRakowski commented 1 month ago

Windows Version

Microsoft Windows [version 10.0.19045.4529]

WSL Version

2.2.4.0

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.153.1-2

Distro Version

Ubuntu 22.04

Other Software

No response

Repro Steps

When launching this script, it stops at the 100th iteration and the WLS networks breaks afterwards :

#!/bin/bash

for (( index=0; index < 200; index++ )); do
    echo "$index"
    ssh -q root@10.0.204.230 "docker images" | grep postgres | tr -s ' ' | cut -d' ' -f 2
done

After the 100th step, can't ssh into 10.0.204.230 and ping can't reach target after few minutes.

Expected Behavior

Expected to run this scripts to the end and WSL networks should stay fine

Actual Behavior

WSL network crashes, can't ssh into target and after 5 minutes, can't ping target. Instead I can ping the IP address of the Windows local machine

Diagnostic Logs

No response

github-actions[bot] commented 1 month ago

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'. Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs Download and execute [collect-wsl-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-wsl-logs.ps1) in an **administrative powershell prompt**: ``` Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1 Set-ExecutionPolicy Bypass -Scope Process -Force .\collect-wsl-logs.ps1 ``` The scipt will output the path of the log file once done. Once completed please upload the output files to this Github issue. [Click here for more info on logging](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#8-collect-wsl-logs-recommended-method) If you choose to email these logs instead of attaching to the bug, please send them to wsl-gh-logs@microsoft.com with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Open similar issues:

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

DomRakowski commented 1 month ago

WslLogs-2024-07-09_14-57-24.zip

github-actions[bot] commented 1 month ago
Diagnostic information ``` Detected appx version: 2.2.4.0 ```
OneBlue commented 1 month ago

This is interesting. Can you details what you mean by network breaks ? Once you get in this state, can you share the output of something like: curl -v microsoft.com to see what the symptoms look like ?

DomRakowski commented 1 month ago

By "network breaks" I mean, I can't get to the target machine. I can only ping WSL IP address and the IP address of the NIC of my host machine.

I tried this behavior with an Ubuntu 24.04 VM and it works fine, it goes through each SSH sessions.

So it's definitely an issue with WSL.

zcobol commented 1 month ago

@DomRakowski where is your postgres container running, same machine or remotely? Using WSL Ubuntu Noble and a Docker container running on a remote RHEL machine all 200 connections are successfully executed:

test output ``` zcobol@toto:~$ ./wsl11759bug.sh 0 latest 1 latest 2 latest 3 latest 4 latest 5 latest 6 latest 7 latest 8 latest 9 latest 10 latest 11 latest 12 latest 13 latest 14 latest 15 latest 16 latest 17 latest 18 latest 19 latest 20 latest 21 latest 22 latest 23 latest 24 latest 25 latest 26 latest 27 latest 28 latest 29 latest 30 latest 31 latest 32 latest 33 latest 34 latest 35 latest 36 latest 37 latest 38 latest 39 latest 40 latest 41 latest 42 latest 43 latest 44 latest 45 latest 46 latest 47 latest 48 latest 49 latest 50 latest 51 latest 52 latest 53 latest 54 latest 55 latest 56 latest 57 latest 58 latest 59 latest 60 latest 61 latest 62 latest 63 latest 64 latest 65 latest 66 latest 67 latest 68 latest 69 latest 70 latest 71 latest 72 latest 73 latest 74 latest 75 latest 76 latest 77 latest 78 latest 79 latest 80 latest 81 latest 82 latest 83 latest 84 latest 85 latest 86 latest 87 latest 88 latest 89 latest 90 latest 91 latest 92 latest 93 latest 94 latest 95 latest 96 latest 97 latest 98 latest 99 latest 100 latest 101 latest 102 latest 103 latest 104 latest 105 latest 106 latest 107 latest 108 latest 109 latest 110 latest 111 latest 112 latest 113 latest 114 latest 115 latest 116 latest 117 latest 118 latest 119 latest 120 latest 121 latest 122 latest 123 latest 124 ---cut--- ```

WSL info:

WSL version: 2.2.4.0
Kernel version: 5.15.153.1-2
WSLg version: 1.0.61
MSRDC version: 1.2.5326
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.26091.1-240325-1447.ge-release
Windows version: 10.0.19045.4651
OneBlue commented 1 month ago

@DomRakowski: Once you get into this broken state, can you share the output of: ssh -v <target> echo ok ?

DomRakowski commented 1 month ago

@zcobol After upgrading from Ubuntu 22.04 to Ubuntu 24.04, when launching the script, it hangs after 100 SSH connections, but somehow WSL manages to recover and proceeds for another 100 SSH connections before hanging for a while. This behavior is looped afterwards :

Can you confirm me this behavior is happening also on your side ?

My postgres container is running on a separate machine.

DomRakowski commented 1 month ago

@OneBlue After upgrading Ubuntu to 24.04, the behavior is not as heavy as before but still present. During the hanging part, it seems I can still ssh into the host, will do some further tests.

Interesting point to note, the behavior that I mentioned (100ssh connections, hanging, 100ssh connections, hanging, ... ) is also present on an another machine, but the hanging part is not as long as my machine.

This machine is a NUC with the following hardware configuration :

The machine that I work on is a laptop with this hardware configuration :

I'm pretty sure this is somehow related in the end to hardware utilization, but it shouldn't straight up break the network.