moby / libnetwork

networking for containers
Apache License 2.0
2.16k stars 881 forks source link

Overlay network broken when outside network mtu is smaller than default (1450) #2661

Closed Eximius closed 1 month ago

Eximius commented 2 years ago

Symptoms

Could not figure out why overlay network just did not work. There was some connectivity, but transiently nothing worked.

Related Issues

As far as I understand, these issues are likely completely the same if not only related: https://github.com/moby/moby/issues/41775 https://github.com/moby/moby/issues/43551 https://github.com/moby/moby/issues/43359 https://github.com/moby/moby/issues/16841

https://stackoverflow.com/questions/52409012/docker-swarm-mode-routing-mesh-not-working-with-wireguard-vpn

https://www.reddit.com/r/docker/comments/u7tq2e/issue_with_docker_overlay_network_not_connecting/

Setup

Two machines A and B connected over a wireguard link with mtu of 1350. Wireguard is proven to be correctly configured and works perfectly for any other access.

A is swarm manager and B is worker.

The only thing deployed is a basic swarmpit setup https://dockerswarm.rocks/swarmpit/.

First clue: swarmpit was unable to receive statistics from B to A. The http post would very slowly (~10min) timeout.

Actual Problem

Clearly the overlay routing mesh does not work. A simple test is curl localhost:<port of swarmpit> on machine B which I used.

According to https://github.com/moby/libnetwork/blob/master/docs/images/network_flow_overlay.png and empirically the connection will route:

  1. B - docker_gwbridge
  2. B - netns ingress_sbox
  3. B - wg0
  4. A - wg0
  5. A - netns <random hash for the routing netns>
  6. A - netns <container>

Since it was HTTP, the request was 80 bytes and the reply was 3500 bytes. The reply had to be fragmented. A tcpdump dump in <random hash for the routing netns> of eth0 (10.0.0.0/24) revealed:

tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
19:17:28.188672 IP (tos 0x0, ttl 63, id 5141, offset 0, flags [DF], proto TCP (6), length 52)
    10.0.0.21.57910 > 10.0.0.13.8090: Flags [F.], cksum 0x0029 (correct), seq 616815329, ack 4072248962, win 128, options [nop,nop,TS val 303267193 ecr 28263940], length 0
19:17:28.189099 IP (tos 0x0, ttl 64, id 14358, offset 0, flags [DF], proto TCP (6), length 52)
    10.0.0.13.8090 > 10.0.0.21.57910: Flags [F.], cksum 0x8d38 (correct), seq 2721, ack 1, win 502, options [nop,nop,TS val 28290269 ecr 303267193], length 0
19:17:28.255852 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    10.0.0.21.57910 > 10.0.0.13.8090: Flags [R], cksum 0x9a48 (correct), seq 616815330, win 0, length 0
19:17:29.130169 IP (tos 0x0, ttl 63, id 48126, offset 0, flags [DF], proto TCP (6), length 60)
    10.0.0.21.58002 > 10.0.0.13.8090: Flags [S], cksum 0x991e (correct), seq 3194242802, win 65535, options [mss 1460,sackOK,TS val 303268135 ecr 0,nop,wscale 9], length 0
19:17:29.130201 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [S.], cksum 0x1382 (correct), seq 1425507161, ack 3194242803, win 64308, options [mss 1410,sackOK,TS val 28291210 ecr 303268135,nop,wscale 7], length 0
19:17:29.195983 IP (tos 0x0, ttl 63, id 48127, offset 0, flags [DF], proto TCP (6), length 52)
    10.0.0.21.58002 > 10.0.0.13.8090: Flags [.], cksum 0x3c8f (correct), seq 1, ack 1, win 128, options [nop,nop,TS val 303268201 ecr 28291210], length 0
19:17:29.196012 IP (tos 0x0, ttl 63, id 48128, offset 0, flags [DF], proto TCP (6), length 131)
    10.0.0.21.58002 > 10.0.0.13.8090: Flags [P.], cksum 0x78e1 (correct), seq 1:80, ack 1, win 128, options [nop,nop,TS val 303268201 ecr 28291210], length 79
19:17:29.196022 IP (tos 0x0, ttl 64, id 10157, offset 0, flags [DF], proto TCP (6), length 52)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], cksum 0x3a88 (correct), seq 1, ack 80, win 502, options [nop,nop,TS val 28291276 ecr 303268201], length 0
19:17:29.203989 IP (tos 0x0, ttl 64, id 10158, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], cksum 0x9ccb (correct), seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28291284 ecr 303268201], length 1398
19:17:29.203992 IP (tos 0x0, ttl 64, id 10159, offset 0, flags [DF], proto TCP (6), length 1374)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [P.], cksum 0xba84 (correct), seq 1399:2721, ack 80, win 502, options [nop,nop,TS val 28291284 ecr 303268201], length 1322
19:17:29.204011 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.0.0.21 > 10.0.0.13: ICMP 10.0.0.21 unreachable - need to frag (mtu 1300), length 556 (wrong icmp cksum da41 (->dbf4)!)
        IP (tos 0x0, ttl 64, id 10158, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28291284 ecr 303268201], length 1398
19:17:29.204017 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.0.0.21 > 10.0.0.13: ICMP 10.0.0.21 unreachable - need to frag (mtu 1300), length 556 (wrong icmp cksum 4105 (->3921)!)
        IP (tos 0x0, ttl 64, id 10159, offset 0, flags [DF], proto TCP (6), length 1374)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [P.], seq 1399:2721, ack 80, win 502, options [nop,nop,TS val 28291284 ecr 303268201], length 1322
19:17:29.346302 IP (tos 0x0, ttl 64, id 10160, offset 0, flags [DF], proto TCP (6), length 1374)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [P.], cksum 0xb9f6 (correct), seq 1399:2721, ack 80, win 502, options [nop,nop,TS val 28291426 ecr 303268201], length 1322
19:17:29.346322 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.0.0.21 > 10.0.0.13: ICMP 10.0.0.21 unreachable - need to frag (mtu 1300), length 556 (wrong icmp cksum 3f05 (->3921)!)
        IP (tos 0x0, ttl 64, id 10160, offset 0, flags [DF], proto TCP (6), length 1374)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [P.], seq 1399:2721, ack 80, win 502, options [nop,nop,TS val 28291426 ecr 303268201], length 1322
19:17:29.636280 IP (tos 0x0, ttl 64, id 10161, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], cksum 0x9b1b (correct), seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28291716 ecr 303268201], length 1398
19:17:29.636351 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.0.0.21 > 10.0.0.13: ICMP 10.0.0.21 unreachable - need to frag (mtu 1300), length 556 (wrong icmp cksum d841 (->dbf4)!)
        IP (tos 0x0, ttl 64, id 10161, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28291716 ecr 303268201], length 1398
19:17:30.196303 IP (tos 0x0, ttl 64, id 10162, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], cksum 0x98eb (correct), seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28292276 ecr 303268201], length 1398
19:17:30.196333 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.0.0.21 > 10.0.0.13: ICMP 10.0.0.21 unreachable - need to frag (mtu 1300), length 556 (wrong icmp cksum d841 (->dbf4)!)
        IP (tos 0x0, ttl 64, id 10162, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28292276 ecr 303268201], length 1398
19:17:31.289626 IP (tos 0x0, ttl 64, id 10163, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], cksum 0x94a5 (correct), seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28293370 ecr 303268201], length 1398
19:17:31.289657 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 576)
    10.0.0.21 > 10.0.0.13: ICMP 10.0.0.21 unreachable - need to frag (mtu 1300), length 556 (wrong icmp cksum d841 (->dbf4)!)
        IP (tos 0x0, ttl 64, id 10163, offset 0, flags [DF], proto TCP (6), length 1450)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [.], seq 1:1399, ack 80, win 502, options [nop,nop,TS val 28293370 ecr 303268201], length 1398
19:17:32.078835 IP (tos 0x0, ttl 63, id 48129, offset 0, flags [DF], proto TCP (6), length 52)
    10.0.0.21.58002 > 10.0.0.13.8090: Flags [F.], cksum 0x30ba (correct), seq 80, ack 1, win 128, options [nop,nop,TS val 303271084 ecr 28291276], length 0
19:17:32.079120 IP (tos 0x0, ttl 64, id 10164, offset 0, flags [DF], proto TCP (6), length 52)
    10.0.0.13.8090 > 10.0.0.21.58002: Flags [F.], cksum 0x1960 (correct), seq 2721, ack 81, win 502, options [nop,nop,TS val 28294159 ecr 303271084], length 0
19:17:32.149975 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    10.0.0.21.58002 > 10.0.0.13.8090: Flags [R], cksum 0x93ea (correct), seq 3194242883, win 0, length 0

Now as I understand the checksum fail will cause the packet that tells at what mtu to fragment to be dropped [correct me if I am wrong]. Which then causes TCP packets to be retried endlessly until the connection fails.

However, it is unclear where this wrong checksum comes from as in <random hash for the routing netns> :

ethtool -k eth0
Features for eth0:
rx-checksumming: off
tx-checksumming: off
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: off
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off

And host machine:

$ ethtool -k wg0
Features for wg0:
rx-checksumming: off
tx-checksumming: off
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: off
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]

Which I set offloading to off (ethtool -K <iface> rx off tx off) because some people had issues hardware checksums failing with wireguard (old): https://lists.zx2c4.com/pipermail/wireguard/2019-December/004801.html and other people had success with fixing similar overlay issue: https://github.com/moby/moby/issues/41775 https://stackoverflow.com/questions/66866779/routing-mesh-stop-working-in-docker-swarm

As I understand the packet goes to vxlan which then receives a reply icmp with an mtu fail.

Setting mtu of eth0 @ <random hash for the routing netns> to be equal to the wg0 mtu entirely fixes the problem. This of course needs to be done on every machine for every network that goes over the vxlan.

Notes

The way I found <random hash for the routing netns> is by running for f in /run/docker/netns/*; do echo $f:; nsenter --net=$f ip a; done and finding the ip range of the target service (10.0.0.0/24 in my case).

A /etc/docker/daemon.json solution of

{
  "mtu": 1350
}

seems to have no effect on swarm.

Seems same problem has been an issue in docker-compose a long time ago: https://mlohr.com/docker-mtu/

General inquiry

If docker / docker swarm could out-of-the-box work with different mtus (even such as checking outbound network interfaces and using the max(config value, min(interfaces' mtu)) would be nice, or at least for the config value to work. Because there is too many other issues open such as this and it's definitely non-trivial to figure out (if you don't know what you're looking for).

I really feel this should just work out of the box (just like all other internet infrastructure).

olljanat commented 2 years ago

Code from here is mostly moved to moby/moby (look #2665 ) and that would be probably better place to report this as well.

However when using non-default MTU it is critical to make sure that it gets used on all interfaces. I don't have wireguard on my lab so was not able to reproduce issue however I managed to get smaller MTU to all interfaces so documenting it to here (new environment setup).

A /etc/docker/daemon.json solution of

{
  "mtu": 1350
}

seems to have no effect on swarm.

This affects only default bridge and containers which are using it.

First of all you need modify docker_gwbridge on all nodes by running these:

docker network rm docker_gwbridge
docker network create --driver bridge --opt com.docker.network.driver.mtu=1350 docker_gwbridge

Then you initiate swarm and modify ingress

docker network rm ingress
docker network create --driver overlay --opt com.docker.network.driver.mtu=1350 ingress

Finally you create test network and test service to it which start one container to all nodes:

docker network create --driver overlay --opt com.docker.network.driver.mtu=1350 test
docker service create --name test --network test --mode global bash sleep infinity

Now you can start shell inside of those container and verify that MTU values are correct:

$ docker exec -it db06ebe1eb83 bash
bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:0A:00:02:05
          inet addr:10.0.2.5  Bcast:10.0.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1300  Metric:1
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:54 (54.0 B)  TX bytes:0 (0.0 B)

eth1      Link encap:Ethernet  HWaddr 02:42:AC:13:00:02
          inet addr:172.19.0.2  Bcast:172.19.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1350  Metric:1
          RX packets:10 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:876 (876.0 B)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

You can also see that eth0 MTU value is actually 1300 instead of 1350. That is because VXLAN header size is 50 bits.

I really feel this should just work out of the box (just like all other internet infrastructure).

Agreed. However I would like to understand how you actually configure wireguard to understand what would be best way to handle this?

EDIT: There looks to be this PR already which would make mtu setting in daemon config effective for all the networks https://github.com/moby/moby/pull/43197

akerouanton commented 1 month ago

As noted above by @olljanat, the code in this repo has been moved to https://github.com/moby/moby/tree/master/libnetwork. This repo is now effectively defunct, and not actively watched by maintainers.

If you think you're hitting the same issue and @olljanat's comment above doesn't help, please fill an issue on https://github.com/moby/moby.