Closed Eximius closed 1 month ago
Code from here is mostly moved to moby/moby (look #2665 ) and that would be probably better place to report this as well.
However when using non-default MTU it is critical to make sure that it gets used on all interfaces. I don't have wireguard on my lab so was not able to reproduce issue however I managed to get smaller MTU to all interfaces so documenting it to here (new environment setup).
A
/etc/docker/daemon.json
solution of{ "mtu": 1350 }
seems to have no effect on swarm.
This affects only default bridge and containers which are using it.
First of all you need modify docker_gwbridge on all nodes by running these:
docker network rm docker_gwbridge
docker network create --driver bridge --opt com.docker.network.driver.mtu=1350 docker_gwbridge
Then you initiate swarm and modify ingress
docker network rm ingress
docker network create --driver overlay --opt com.docker.network.driver.mtu=1350 ingress
Finally you create test network and test service to it which start one container to all nodes:
docker network create --driver overlay --opt com.docker.network.driver.mtu=1350 test
docker service create --name test --network test --mode global bash sleep infinity
Now you can start shell inside of those container and verify that MTU values are correct:
$ docker exec -it db06ebe1eb83 bash
bash-5.1# ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:0A:00:02:05
inet addr:10.0.2.5 Bcast:10.0.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1300 Metric:1
RX packets:1 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:54 (54.0 B) TX bytes:0 (0.0 B)
eth1 Link encap:Ethernet HWaddr 02:42:AC:13:00:02
inet addr:172.19.0.2 Bcast:172.19.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1350 Metric:1
RX packets:10 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:876 (876.0 B) TX bytes:0 (0.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
You can also see that eth0 MTU value is actually 1300 instead of 1350. That is because VXLAN header size is 50 bits.
I really feel this should just work out of the box (just like all other internet infrastructure).
Agreed. However I would like to understand how you actually configure wireguard to understand what would be best way to handle this?
EDIT: There looks to be this PR already which would make mtu
setting in daemon config effective for all the networks https://github.com/moby/moby/pull/43197
As noted above by @olljanat, the code in this repo has been moved to https://github.com/moby/moby/tree/master/libnetwork. This repo is now effectively defunct, and not actively watched by maintainers.
If you think you're hitting the same issue and @olljanat's comment above doesn't help, please fill an issue on https://github.com/moby/moby.
Symptoms
Could not figure out why overlay network just did not work. There was some connectivity, but transiently nothing worked.
Related Issues
As far as I understand, these issues are likely completely the same if not only related: https://github.com/moby/moby/issues/41775 https://github.com/moby/moby/issues/43551 https://github.com/moby/moby/issues/43359 https://github.com/moby/moby/issues/16841
https://stackoverflow.com/questions/52409012/docker-swarm-mode-routing-mesh-not-working-with-wireguard-vpn
https://www.reddit.com/r/docker/comments/u7tq2e/issue_with_docker_overlay_network_not_connecting/
Setup
Two machines A and B connected over a wireguard link with mtu of 1350. Wireguard is proven to be correctly configured and works perfectly for any other access.
A is swarm manager and B is worker.
The only thing deployed is a basic swarmpit setup https://dockerswarm.rocks/swarmpit/.
First clue: swarmpit was unable to receive statistics from B to A. The http post would very slowly (~10min) timeout.
Actual Problem
Clearly the overlay routing mesh does not work. A simple test is
curl localhost:<port of swarmpit>
on machine B which I used.According to https://github.com/moby/libnetwork/blob/master/docs/images/network_flow_overlay.png and empirically the connection will route:
<random hash for the routing netns>
<container>
Since it was HTTP, the request was 80 bytes and the reply was 3500 bytes. The reply had to be fragmented. A tcpdump dump in
<random hash for the routing netns>
of eth0 (10.0.0.0/24) revealed:Now as I understand the checksum fail will cause the packet that tells at what mtu to fragment to be dropped [correct me if I am wrong]. Which then causes TCP packets to be retried endlessly until the connection fails.
However, it is unclear where this wrong checksum comes from as in
<random hash for the routing netns>
:And host machine:
Which I set offloading to off (
ethtool -K <iface> rx off tx off
) because some people had issues hardware checksums failing with wireguard (old): https://lists.zx2c4.com/pipermail/wireguard/2019-December/004801.html and other people had success with fixing similar overlay issue: https://github.com/moby/moby/issues/41775 https://stackoverflow.com/questions/66866779/routing-mesh-stop-working-in-docker-swarmAs I understand the packet goes to vxlan which then receives a reply icmp with an mtu fail.
Setting mtu of eth0 @
<random hash for the routing netns>
to be equal to the wg0 mtu entirely fixes the problem. This of course needs to be done on every machine for every network that goes over the vxlan.Notes
The way I found
<random hash for the routing netns>
is by runningfor f in /run/docker/netns/*; do echo $f:; nsenter --net=$f ip a; done
and finding the ip range of the target service (10.0.0.0/24 in my case).A
/etc/docker/daemon.json
solution ofseems to have no effect on swarm.
Seems same problem has been an issue in docker-compose a long time ago: https://mlohr.com/docker-mtu/
General inquiry
If docker / docker swarm could out-of-the-box work with different mtus (even such as checking outbound network interfaces and using the
max(config value, min(interfaces' mtu))
would be nice, or at least for the config value to work. Because there is too many other issues open such as this and it's definitely non-trivial to figure out (if you don't know what you're looking for).I really feel this should just work out of the box (just like all other internet infrastructure).