microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.23k stars 808 forks source link

Lost routing table once in a while after using mirrored networking #10588

Open leoleoasd opened 11 months ago

leoleoasd commented 11 months ago

Windows Version

Microsoft Windows [Version 10.0.22621.2361]

WSL Version

2.0.0.0

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.123.1-1

Distro Version

Archlinux

Other Software

No response

Repro Steps

[wsl2]
memory=60GB
[experimental]
autoMemoryReclaim=gradual
sparseVhd=true
networkingMode=mirrored
dnsTunneling=true

use it for a while, and:

$ ip r
169.254.0.0/16 dev eth4 proto kernel scope link metric 261

after rebooting wsl:

$ ip r
default via 10.0.0.1 dev eth5 proto kernel metric 30
10.0.0.0/24 dev eth5 proto kernel scope link metric 286
10.0.0.1 dev eth5 proto kernel scope link metric 30
169.254.0.0/16 dev eth4 proto kernel scope link metric 261

Expected Behavior

$ ip r
default via 10.0.0.1 dev eth5 proto kernel metric 30
10.0.0.0/24 dev eth5 proto kernel scope link metric 286
10.0.0.1 dev eth5 proto kernel scope link metric 30
169.254.0.0/16 dev eth4 proto kernel scope link metric 261

Actual Behavior

$ ip r
169.254.0.0/16 dev eth4 proto kernel scope link metric 261

Diagnostic Logs

No response

keith-horton commented 11 months ago

Hi there. Can you get us a trace to see what happened? https://github.com/microsoft/WSL/blob/master/diagnostics/wsl_networking.wprp

Thanks!

leoleoasd commented 11 months ago

How can I run it? Is there a 'readme'?

leoleoasd commented 11 months ago

Should I wait for it to happen again, then run the trace, or I should run the trace before and wait for it to happen?

keith-horton commented 11 months ago

Sorry, that file (*.wprp) is used with WPR.exe. But I think there's a better path to take, since the repro might take a long time. I tested this - and it will capture only the WSL -related traces to keep the filesize to a minimum.

With no wsl instance running (wsl --shutdown), please run the following from an admin cmd shell:

logman start wsl_trace -p {b99cdb5a-039c-5046-e672-1a0de0a40211} -o wsl_trace.etl -ets

After you see the repro, please run the following to stop the trace: logman stop wsl-trace -ets

You might have to run it a couple of times ... logman can sometimes fail the first time or 2. e.g.

C:>logman start wsl_trace -p {b99cdb5a-039c-5046-e672-1a0de0a40211} -o wsl_trace.etl -ets The command completed successfully.

<<<<<<<< Now Repro >>>>>>>>

C:>logman stop wsl-trace -ets

Error: Data Collector Set was not found.

C:>logman stop wsl_trace -ets The command completed successfully.

C:>dir *.etl Volume in drive C has no label. Volume Serial Number is C64F-A1F6

Directory of C:\

10/04/2023 07:53 PM 368,640 wsl_trace.etl 1 File(s) 368,640 bytes 0 Dir(s) 34,756,386,816 bytes free

Please send back the generated ETL file.

Thanks!

leoleoasd commented 11 months ago

What information will it contain? Will there by any sensitive / personal information, can I send it to your email instead of uploading it here publicly?

keith-horton commented 11 months ago

It will contain IP addresses and possibly machine names / DNS names. You can send it to me directly if you would prefer (khorton@microsoft.com).

snjnz commented 11 months ago

I'm seeing similar to this and #10587

What I'm noticing is that something is triggering the deletion of the routes, which I've proven with sudo rtmon file iproute.log, reading the log back I see (public prefix removed):

Timestamp: Sat Oct  7 17:28:16 2023 440185 us
172.20.127.1 dev eth0 lladdr 4c:ed:fb:45:dd:af PROBE
Timestamp: Sat Oct  7 17:28:16 2023 440549 us
172.20.127.1 dev eth0 lladdr 4c:ed:fb:45:dd:af REACHABLE
Timestamp: Sat Oct  7 17:28:26 2023 195941 us
Deleted 2: eth0    inet 172.20.127.102/24 brd 172.20.127.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
Timestamp: Sat Oct  7 17:28:26 2023 195961 us
Deleted broadcast 172.20.127.255 dev eth0 table local proto kernel scope link src 172.20.127.102
Timestamp: Sat Oct  7 17:28:26 2023 195962 us
Deleted local 172.20.127.102 dev eth0 table local proto kernel scope host src 172.20.127.102
Timestamp: Sat Oct  7 17:28:26 2023 195963 us
Deleted 172.20.127.254 dev eth0 lladdr b8:69:f4:3a:59:3a REACHABLE
Timestamp: Sat Oct  7 17:28:26 2023 195965 us
Deleted 172.20.127.1 dev eth0 lladdr 4c:ed:fb:45:dd:af REACHABLE
Timestamp: Sat Oct  7 17:28:26 2023 196000 us
2: eth0    inet 172.20.127.102/24 brd 172.20.127.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
Timestamp: Sat Oct  7 17:28:26 2023 196003 us
local 172.20.127.102 dev eth0 table local proto kernel scope host src 172.20.127.102
Timestamp: Sat Oct  7 17:28:26 2023 196004 us
broadcast 172.20.127.255 dev eth0 table local proto kernel scope link src 172.20.127.102
Timestamp: Sat Oct  7 17:28:26 2023 198272 us
Deleted 2: eth0    inet6 fe80::c6f:6f3f:90cb:28b0/64 scope link nodad noprefixroute
       valid_lft forever preferred_lft forever
Timestamp: Sat Oct  7 17:28:26 2023 198282 us
Deleted local fe80::c6f:6f3f:90cb:28b0 dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:28:26 2023 200308 us
Deleted 2: eth0    inet6 [prefix]:1dd6:545c:c659:49db/64 scope global nodad deprecated noprefixroute
       valid_lft forever preferred_lft 0sec
Timestamp: Sat Oct  7 17:28:26 2023 200318 us
Deleted local [prefix]:1dd6:545c:c659:49db dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:28:26 2023 201847 us
2: eth0    inet6 fe80::c6f:6f3f:90cb:28b0/64 scope link nodad noprefixroute
       valid_lft forever preferred_lft forever
Timestamp: Sat Oct  7 17:28:26 2023 201858 us
local fe80::c6f:6f3f:90cb:28b0 dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:28:26 2023 202009 us
2: eth0    inet6 [prefix]:1dd6:545c:c659:49db/64 scope global nodad deprecated noprefixroute
       valid_lft forever preferred_lft 0sec
Timestamp: Sat Oct  7 17:28:26 2023 202017 us
local [prefix]:1dd6:545c:c659:49db dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:28:26 2023 203488 us
2: eth0    inet6 fddf:9aa6:8476:4640:1c8a:872d:62fa:b502/128 scope global nodad noprefixroute
       valid_lft forever preferred_lft forever
Timestamp: Sat Oct  7 17:28:26 2023 203502 us
local fddf:9aa6:8476:4640:1c8a:872d:62fa:b502 dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:28:26 2023 204435 us
2: eth0    inet6 fddf:9aa6:8476:4640:8321:bcd7:fa1e:1925/64 scope global nodad deprecated noprefixroute
       valid_lft forever preferred_lft 0sec
Timestamp: Sat Oct  7 17:28:26 2023 204444 us
local fddf:9aa6:8476:4640:8321:bcd7:fa1e:1925 dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:28:36 2023 360230 us
fe80::ba69:f4ff:fe3a:593a dev eth0 lladdr b8:69:f4:3a:59:3a router PROBE
Timestamp: Sat Oct  7 17:28:36 2023 360683 us
fe80::ba69:f4ff:fe3a:593a dev eth0 lladdr b8:69:f4:3a:59:3a router REACHABLE
Timestamp: Sat Oct  7 17:29:06 2023 920178 us
fe80::ba69:f4ff:fe3a:593a dev eth0 lladdr b8:69:f4:3a:59:3a router STALE
Timestamp: Sat Oct  7 17:29:39 2023 184088 us
Deleted 2: eth0    inet6 fddf:9aa6:8476:4640:1c8a:872d:62fa:b502/128 scope global nodad noprefixroute
       valid_lft forever preferred_lft forever
Timestamp: Sat Oct  7 17:29:39 2023 184109 us
Deleted local fddf:9aa6:8476:4640:1c8a:872d:62fa:b502 dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:29:39 2023 185185 us
Deleted 2: eth0    inet6 fddf:9aa6:8476:4640:8321:bcd7:fa1e:1925/64 scope global nodad deprecated noprefixroute
       valid_lft forever preferred_lft 0sec
Timestamp: Sat Oct  7 17:29:39 2023 185192 us
Deleted local fddf:9aa6:8476:4640:8321:bcd7:fa1e:1925 dev eth0 proto kernel metric 0 pref medium
Timestamp: Sat Oct  7 17:29:49 2023 240283 us
fe80::ba69:f4ff:fe3a:593a dev eth0 lladdr b8:69:f4:3a:59:3a router PROBE
Timestamp: Sat Oct  7 17:29:49 2023 240627 us
fe80::ba69:f4ff:fe3a:593a dev eth0 lladdr b8:69:f4:3a:59:3a router REACHABLE
Timestamp: Sat Oct  7 17:30:19 2023 240173 us
fe80::ba69:f4ff:fe3a:593a dev eth0 lladdr b8:69:f4:3a:59:3a router STALE

Readding the routes with:

$ sudo ip route add 172.20.127.0/24 dev eth0
$ sudo ip route add default via 172.20.127.254 dev eth0

Restores connectivity until the next time the routes are purged (which makes me suspect the scripts linked in both issues aren't too helpful because they seem to rely on shutting the WSL instance and therefore running an extended time because it doesn't happen predictably and like with the issue ending in 7 there seems to be an extended initial period where it works fine).

Filtering Windows Event Logger (+/- a few mins) showed the following entries happening at the exact time as the route deletions:

Information 7/10/2023 5:28:26 pm    WMI-Activity    5857    None
Information 7/10/2023 5:28:26 pm    Host-Network-Service    1041    None
Information 7/10/2023 5:28:26 pm    Host-Network-Service    1041    None
Information 7/10/2023 5:28:26 pm    SMBServer   1040    None
Information 7/10/2023 5:28:26 pm    SMBClient   30810   None
Information 7/10/2023 5:28:26 pm    SMBServer   1040    None
Information 7/10/2023 5:28:26 pm    Time-Service    266 None
Information 7/10/2023 5:28:26 pm    SMBClient   30810   None

SMB Entries point to:

Added a TCP/IP transport interface.

Name: Ethernet
InterfaceIndex: 0x8

Guidance:
A TCP/IP binding was added to the specified network adapter for the SMB client. The SMB client can now send and receive SMB traffic on this network adapter using TCP/IP. You should expect this event when a computer restarts or when a previously disabled network adaptor is re-enabled. No user action is required.

While the Host-Network-Service shows changes to IPv6 networking (specifically Unique Local Unicast addresses on the host's main Ethernet connection).

Export of above Event Viewer logs

This makes me think that some service responsible for keeping the WSL instance's routing/etc in sync with the host in mirrored mode forgets to readd IPv4 routes after a period of time if it detects a change to the Windows 11 host's interfaces - even if minor. IPv6 routes remain intact, as seen here:

$ ping -c4 2404:6800:4006:812::200e
PING 2404:6800:4006:812::200e(2404:6800:4006:812::200e) 56 data bytes
64 bytes from 2404:6800:4006:812::200e: icmp_seq=1 ttl=116 time=42.2 ms
64 bytes from 2404:6800:4006:812::200e: icmp_seq=2 ttl=116 time=37.3 ms
64 bytes from 2404:6800:4006:812::200e: icmp_seq=3 ttl=116 time=36.7 ms
64 bytes from 2404:6800:4006:812::200e: icmp_seq=4 ttl=116 time=36.9 ms

--- 2404:6800:4006:812::200e ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3005ms
rtt min/avg/max/mdev = 36.699/38.273/42.189/2.271 ms
$ ping -c4 1.1.1.1
ping: connect: Network is unreachable
leoleoasd commented 11 months ago

@keith-horton I've seen a repro and sent you the trace.

keith-horton commented 11 months ago

@snjnz - that's great information. Yes, that shows that something brought down that virtual NIC on the host, which triggered the vmNIC in the container to go down and up again. Linux deletes addresses and routes when its adapters go down (unlike Windows) - and WSL was not aware that it went down behind us.

I'm bringing up a feature to detect when this happens and reset all addresses and routes.

@leoleoasd , I'm looking over your traces now.

snjnz commented 11 months ago

Must be psychic because I'd just opened the issue to add an additional observation.

Over the last weekend I was ruling out IPv6 (which I'd just enabled on my network) as causing internet stability issues. As part of this I disabled router advertisements on the router for a span of approx 48 hours. During this period, WSL with networking mode set to mirrored operated as expected. So now I'm wondering if it's specific to IPv6.

Yes, that shows that something brought down that virtual NIC on the host, which triggered the vmNIC in the container to go down and up again. Linux deletes addresses and routes when its adapters go down (unlike Windows) - and WSL was not aware that it went down behind us.

The IP addresses (both v4 and v6) seem to remain up, it's just the routing table that is affected. Additionally, the IPv6 table doesn't always clear (I don't have copy and pastes from ip -6 route handy sorry), leaving IPv6 sometimes operable while IPv4 isn't.

From the trace I added above, can actually see that it seems to be deleting and readding the same routes already, so could it be a race condition of some description that leaves the routing table incomplete? Example these two entries:

Deletion:

Timestamp: Sat Oct 7 17:28:26 2023 200308 us Deleted 2: eth0 inet6 [prefix]:1dd6:545c:c659:49db/64 scope global nodad deprecated noprefixroute valid_lft forever preferred_lft 0sec Timestamp: Sat Oct 7 17:28:26 2023 200318 us Deleted local [prefix]:1dd6:545c:c659:49db dev eth0 proto kernel metric 0 pref medium

Readd:

Timestamp: Sat Oct 7 17:28:26 2023 202009 us 2: eth0 inet6 [prefix]:1dd6:545c:c659:49db/64 scope global nodad deprecated noprefixroute valid_lft forever preferred_lft 0sec Timestamp: Sat Oct 7 17:28:26 2023 202017 us local [prefix]:1dd6:545c:c659:49db dev eth0 proto kernel metric 0 pref medium

Another hunch I was trying to have a look at was given the seemingly 24hour-duration, was if Windows' Temporary addressing might be playing a part, but I've been having trouble disabling that (or at least getting Windows to honour it).

keith-horton commented 11 months ago

@leoleoasd , I can see the network status on the host is changing ... a lot. I don't know the source of what is changing it. I see that we pushed the address and route information successfully. But if you are seeing the routes deleted, then that suggests something also change the state of that vNIC on the host.

This is more evidence that we need to talk within the team about better detecting changes from within the Linux container, and resync when we see unexpected changes.

@snjnz - yes, you're correct, only the routes are deleted when the interface bounces down and up again. Though I would need to see a WSL trace to see what changes we are directing from WSL vs. what changes are happening from within Linux (likely responding to the vmNIC state changing).

Thanks!

snjnz commented 11 months ago

I decided to take a look at my hunch regarding temporary IP addresses, and set the Temporary Preferred Lifetime to 15mins. The first trigger coincided with the PreferredLifetime on the IP at boot reaching zero.

I've grabbed some traces and will e-mail you the link @keith-horton in the next 10 mins or so.

Once again I'm seeing the same log messages coinciding with the routes dropping in WSL as well, also seeing within a few seconds in NetworkProfile Event Viewer:

Network State Change Fired
    New Internet Connection Profile: false
    Connection Cost Changed: false
    Domain Connectivity Level Changed: false
    Network Connectivity Level Changed: false
    Host Name Changed: true
    Wwan Registration State Changed: false
    Tethering Operational State Changed: false
    Tethering Client Count Changed: false

It's been forever since I've done anything with Windows networking so I have no idea where other events are logged (Get-NetIPAddress only provides countdowns and not exact times for instance which is unfortunate).

One thing I've also noticed, setting the much shorter temporary address settings results in the routes returning eventually which I think furthers my thoughts that it's a race condition with the routes going up and down.

leoleoasd commented 11 months ago

Weird, why the network status on the host is changing 🤔 this is a PC, and it was never moved or connected to another internet Can you give me some details about the changing network status?

leoleoasd commented 11 months ago

Oh maybe it's because I'm running tailscale on my PC, and it's connection may be unstable?

sbalmos commented 11 months ago

Coming over here, I'm the author of the other mentioned issue #10587, I can also confirm/suspect it's a race condition of some sort, related to IPv6's temporary IP timeouts. Now that I read this thread's background, my observations were also the Linux route table was changed (and thus losing the IPv4 default route) at the timestamp of WSL VM Start + Temporary IP Preferred Lifetime + Temporary IP Desync Time - Temporary IP Regenerate Time. For giggles, I changed my Temporary IP Preferred Lifetime to 5 minutes with a Desync Time of 10 seconds. I lost connectivity in WSL at the expected timestamp.

For the "routes returning eventually", I assume you're referring to the IPv4 routes, I wonder if that's around some time related to your IPv4 DHCP lease renewal?

I'm going to try running IPv6 with Temporary IPs completely disabled and see what happens for a while.

snjnz commented 11 months ago

I can also confirm/suspect it's a race condition of some sort, related to IPv6's temporary IP timeouts.

Fingers crossed this makes it easier to debug on Microsoft's side. I'm glad we've managed to come up with a way to quicken the process to make it easier for the devs.

For the "routes returning eventually", I assume you're referring to the IPv4 routes, I wonder if that's around some time related to your IPv4 DHCP lease renewal?

What I'd noticed is that while I was writing my update here and sending Keith the traces, IPv4 pings I had running in the background started working again (it seemed to coincide with the next Temporary address renewal, but pointed towards restoration of routes), but they got dropped a little later not to return, so not as conclusive, but there is an aspect of 'they can come back'.

keith-horton commented 10 months ago

Thank you all for your help debugging this. I was able to reproduce this and I have a fix which will hopefully be out with the next update.

keith-horton commented 9 months ago

The preview release should have the fix for this. Which hopefully will be going to the public release soon. Thanks again!

leoleoasd commented 9 months ago

How can we install the preview release? Is it a public preview?

keith-horton commented 9 months ago

wsl --update --pre-release

This should update to the latest pre-release.

snjnz commented 9 months ago

The preview release should have the fix for this. Which hopefully will be going to the public release soon. Thanks again!

Haven't had any problems with the last couple of pre-releases, thank you for your work on this!

suxmao commented 8 months ago

Is this problem fixed?

leoleoasd commented 8 months ago

I'm still encountering this problem after upgrading to the latest preview; I'm using Tailscale, that may be the issue.