Link between Kubernetes hosts may not work due to NAT on the k8s cluster

Hi,

Testing Mininet-Sec in a certain Kubernetes host ended up highlighting one corner case for VXLAN Links where the Link does not work as expected due to SNAT being applied on the network between pods.

Basically, if you sniffer the traffic between the two K8s hosts, you will notice the following behavior:

Host 1:

# ip addr show dev eth0
3: eth0@if95: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 86:xx:xx:xx:xx:df brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.xx.xx.173/32 scope global eth0
       valid_lft forever preferred_lft forever
# ip -d link show type vxlan
386: s3-eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 8e:ae:f5:8c:01:f8 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
    vxlan id 3 remote 10.xx.xx.162 srcport 0 0 dstport 8472 ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx
    openvswitch_slave addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Host 3:

# ip -d link show type vxlan
4: h3-eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether da:23:fa:6a:69:4c brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    vxlan id 3 remote 10.xx.xx.173 srcport 0 0 dstport 8472 ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
# ip addr show dev eth0
3: eth0@if113: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 2a:xx:xx:xx:xx:e0 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.xx.xx.162/32 scope global eth0
       valid_lft forever preferred_lft forever

As you can see above, the VXLAN tunnel is created between 10.xx.xx.162 <-> 10.xx.xx.173, however, if you run a ping from the xlan interface and leave TCPDUMP running on the other side (tcpdump is running on Host 3):

# tcpdump -i eth0 -n -e
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:09:17.455203 ee:ee:ee:ee:ee:ee > 2a:xx:xx:xx:xx:e0, ethertype IPv4 (0x0800), length 92: 192.168.xx.3.57234 > 10.xx.xx.162.8472: OTV, flags [I] (0x08), overlay 0, instance 1
a2:3c:ee:1c:26:8d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.0.0.3 tell 10.0.0.1, length 28
09:09:17.455236 2a:xx:xx:xx:xx:e0 > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 92: 10.xx.xx.162.57234 > 192.168.xx.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1
da:23:fa:6a:69:4c > a2:3c:ee:1c:26:8d, ethertype ARP (0x0806), length 42: Reply 10.0.0.3 is-at da:23:fa:6a:69:4c, length 28
09:09:17.455319 ee:ee:ee:ee:ee:ee > 2a:xx:xx:xx:xx:e0, ethertype IPv4 (0x0800), length 120: 192.168.xx.3 > 10.xx.xx.162: ICMP 192.168.xx.3 udp port 8472 unreachable, length 86

As you can see above, Host 3 tries to reply the VXLAN tunnel with the source IP address seen on the request packet, instead of the IP address actually configured.

Workaround: you can always force the source IP address to be overwritten using netfilter/nat/SNAT actions, but definitely that is not a good approach.

Despite the fact that this seems to be an error on the configuration of the Kubernetes cluster (especially because each node has a valid routing address schema between, so it wouldn't need SNAT at all when communicating between them), Mininet-Sec should be robust enough to avoid this behavior.

mininet-sec / mininet-sec

Link between Kubernetes hosts may not work due to NAT on the k8s cluster #18