microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.37k stars 816 forks source link

Properly mirror all packets (including kernel sockets) for mirrored network mode #10842

Open wizpresso-steve-cy-fan opened 10 months ago

wizpresso-steve-cy-fan commented 10 months ago

Is your feature request related to a problem? Please describe.

I have investigated this problem for a few weeks now and I realized all the problems I have, they all the following common symptoms:

  1. No bidirectional traffic
  2. Was able to intercept their respective packets in Wireshark from both side*, but it was not delivered to the WSL kernel eventually.

*: Given two machines A and B, if WSL of A sends a packet to machine B, the packets are delivered to B and captured from Windows Wireshark side, but no captures from the tcpdump inside the WSL of B

In addition, some examples are (Without loss of generality):

  1. IPVS creates a virtual interface and connects to the backend servers from kernel itself. This surprisingly does not work.
  2. Wireguard kernel module also does the same in a similar way, creates a UDP socket in kernel, connects to the peer by UDP from within kernel...
  3. Any eBPF program also doesn't work
  4. IP-over-UDP does not work
  5. GRE does not work
  6. VXLAN does not work
  7. IPIP does not work
  8. L2TPv3 was not tested but given the result in VXLAN, I assumed it won't work either

And thus, with all the aforementioned clues, and to my educated guesses, because those are all managed by the Linux kernel and not userspace, so if any userspace program works, then this means anything from kernel is clearly missing the mirroring somehow.

This theory is supported with this Wireguard workaround: https://github.com/microsoft/WSL/issues/10841#issuecomment-1831230148

I noticed there is something interesting with the following nftables rules (WLOG again):

table ip nat {
        chain WSLPOSTROUTING {
                type nat hook postrouting priority srcnat - 1; policy accept;
                oif "eth0" udp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth0" tcp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth1" udp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth1" tcp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth2" udp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth2" tcp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth3" udp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth3" tcp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth4" udp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth4" tcp sport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900
                oif "eth4" udp dport 1-65535 meta mark != 0x00000001 counter packets 0 bytes 0 masquerade to :59600-59900

        }

}

You can't reproduce those rules with normal nft commands, so I guess this is how mirrored mode was implemented at the moment (sadly the technical info on this was not transparent): by a combination of userspace ptrace (to rewrite the socket options and enable mirrored network support transparently), nftables, and Hyper-V networking hacks that zero-copies network packet from loopback device, then masquerade the network traffic to the main interface, all back and forth.

If my theory is correct, that means any kernel-initiated sockets, will end up forever stuck because it somehow bypassed the ptrace or somehow disregarded the nft rules, which means while the packets were delivered, it was missing some important information so that the Hyper-V network side did not realize this port was registered for mirroring from either incoming and outgoing side (and also sadly, not a true zero-copy packet mirror due to the masquerade which relied on stateful conntrack). And in the end the packet was eventually discarded on the Windows side due to missing receiver and timeout in the end.

To further prove this theory, I think IPSec would work if we separated the network connection to be handled by userspace Strongswan solely, and only offload TLS stuff to the kernel (I'm not sure if we can use KTLS that way). Another good proof would be testing whether KSMBD (kernel level Samba) works.

Related items:

10841

10840

https://github.com/microsoft/WSL/discussions/10730

Describe the solution you'd like See if this theory is correct, and then check if anything necessary is missing for such scenarios, and try to implement them...

Describe alternatives you've considered Shove it down your stomach and accept that mirrored network mode is for userspace only

Additional context Unsurprisingly, this is a missing feature rather than a bug, because the people from MSFT clearly did not expect this kind of rare applications.

For us, we are running k0s, kube-proxy under IPVS mode, and VXLAN for Calico. We tried switching between IPIP, VXLAN, Wireguard and raw routing mode (that means ip route ... via <local mirrored interface IP address> since all of our WSL machines are on the same L2 network) in Calico, none of that worked.

With a custom kernel and switching to iptables for kube-proxy mode, it worked somehow, but pod-to-pod communication is still unreachable and the conntrack table is full of 0-length packets under SYN_WAIT somehow.

Rant time:

I'm highly aware that WSL is supposedly designated for a single-user based, semi-ephemeral developer environment, and not for running any critical server application like K8S after all.

However, as we have a huge sunk cost in Windows for our workstations, we cannot afford to switch to Linux, and add insult to the injury we need to use a lot of GPU resources for machine learning, but this is missing on the Windows side. We wanted to combine the best of both world by using kubeflow to manage the idling GPU resources efficiently.

And I'm very sure any MSFT consultants would suggest me to use an actual Hyper-V VM for this purpose, but we cannot afford to have DDA unfortunately for us being a small startup, so until MSFT have official support for GPU-P on Hyper-V VM (heck, even Windows GPU-P support is hidden behind secret Powershell commands and options), our only best bet is to use WSL2 and their official CUDA GPU support...with a bit of hack from me to gpu-operator as well.

We do have an ultimate workaround for all these: network bridge mode. This solves everything mentioned above, except this is recently "deprecated" in favor of the mirrored network mode...

chanpreetdhanjal commented 10 months ago

Could you please follow the steps below and attach the diagnostic logs? https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

wizpresso-steve-cy-fan commented 10 months ago

@chanpreetdhanjal I bet 99% this problem couldn't be found by diagnostic logs, but I will try.

shixudong2020 commented 6 months ago

Could you please follow the steps below and attach the diagnostic logs? https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

It is true that kernel sockets cannot be mirrored,temporary solution: 1.for wireguard with ListenPort=51820 at WSL run "iperf -u -s -p 51820",and error "Address already in use",then wireguard can receive udp 2.for vxvlan with group and dstport 4789 at WSL run "iperf -u -s -p 4789",and error "Address already in use",then vxlan can normally work

feng-yifan commented 3 months ago

we have many windows pc installed k8s in wsl2, these k8s connect to the control plane as worker node with taint so that everyone can use some heavy infrastructure in server for now, i think use bridge network is the best practice