Overlay network with multiple subnets fails when new machines are added

We've noticed issues when attempting to attach multiple sub-networks onto an overlay network where sometimes connectivity is lost when new machines are added to the pool. In this case, any containers provisioned on the new machines cannot communicate with any containers on the pre-existing machines if the containers are on the different subnets.

If we have an existing machine M1 with container CA on subnet 10.0.0.0/24, and add a new machine M2 with container CB on subnet 10.0.1.0/24, packets can flow perfectly fine from M1CA -> M2CB, but packets sometimes cannot transit the other way. If it "works", everything is fine (and we can freely restart M2CB with no issues), but if it doesn't work, no amount of restarting M2CB will change anything. Restarting M1CA does work (assumably because the routing info is re-broadcast onto Serf), but this is a nasty workaround in this case.

After looking into this deeper, it seems that everything is fine inside the container. M2CB has a routing rule to route 10.0.0.0/24 via 10.0.1.1 and the ARP cache has an entry for 10.0.1.1. Outside the container is sometimes ruined though; when it's "broken", the network namespace for the overlay network only contains a single bridge with IP 10.0.1.1 and a single VxLAN device, and does not contain any routing rules for the 10.0.0.0/24 subnet, rather than two separate bridges, all appropriate routing table rules, and separate VxLAN devices when everything is "working". It's as if joinSubnetSandbox is never being called on M2 for the 10.0.0.0/24 subnet. Of course, when this happens, standard peerDb resolution over Serf cannot happen since there is no network adapter for the 10.0.0.0/24 subnet, hence ARP does not take place and the netlink listener does not get notified.

I've got readouts from ip addr/ip route inside the VxLAN network namespace for when it's working and broken, as well as debug logs from M2 in both situations and a script to reproduce. In this situation, the overlay network has the IP ranges 192.168.100.0/24 and 192.168.101.0/24. M1CA has IP 192.168.101.2, and M2CB has an IP from the 192.168.100.0/24 range:

Working

root@test-2:/var/log# ip netns ls
c47887a8d514
1-e12484abfd
38a2eeb24909

root@test-2:/var/log# ip netns exec 1-e12484abfd ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether f6:19:6f:c8:c0:b3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.101.1/24 scope global br0
       valid_lft forever preferred_lft forever
3: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 5e:67:d2:66:d6:0d brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.1/24 scope global br1
       valid_lft forever preferred_lft forever
10: vxlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN group default
    link/ether f6:19:6f:c8:c0:b3 brd ff:ff:ff:ff:ff:ff
11: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br1 state UNKNOWN group default
    link/ether 5e:67:d2:66:d6:0d brd ff:ff:ff:ff:ff:ff
13: veth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br1 state UP group default
    link/ether b2:1f:72:98:89:e5 brd ff:ff:ff:ff:ff:ff

root@test-2:/var/log# ip netns exec 1-e12484abfd ip route
192.168.100.0/24 dev br1  proto kernel  scope link  src 192.168.100.1
192.168.101.0/24 dev br0  proto kernel  scope link  src 192.168.101.1

Broken

root@test-2:/var/run# ip netns ls
b5b3842ec2e6
1-d6b537bdbf
655692b7ab8c

root@test-2:/var/run# ip netns exec 1-d6b537bdbf ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 2e:2d:26:eb:d1:df brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.1/24 scope global br0
       valid_lft forever preferred_lft forever
10: vxlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN group default
    link/ether 2e:2d:26:eb:d1:df brd ff:ff:ff:ff:ff:ff
12: veth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default
    link/ether a2:8a:51:fa:cb:3d brd ff:ff:ff:ff:ff:ff

root@test-2:/var/run# ip netns exec 1-d6b537bdbf ip route
192.168.100.0/24 dev br0  proto kernel  scope link  src 192.168.100.1

Reproduction script

Sadly, this isn't a one-shot repro, you may have to destroy the VMs this creates a run it a few times. Eventually, the containing running ping will report 92 bytes from 192.168.100.1: Destination Net Unreachable

#!/bin/bash
set -e
MACHINE_NAME=consul-host
MACHINE_ARGS="-s `pwd`/test_machines"

# Create the consul machine
echo "Provisioning Consul machine"
docker-machine ${MACHINE_ARGS} create -d virtualbox ${MACHINE_NAME}

eval $(docker-machine ${MACHINE_ARGS} env ${MACHINE_NAME})
CONSUL_IP=$(docker-machine ${MACHINE_ARGS} ip ${MACHINE_NAME})
export CONSUL=consul://${CONSUL_IP}:8500
export HOST_IP=$CONSUL_IP

# Bootstrap Consul
echo "Bootstrapping Consul"
docker run --name consul -d -p 8400:8400 -p 8500:8500 -p 8600:8600 progrium/consul -server -bootstrap

# Create a new machine for testing
MACHINE_NAME=test-1
echo "Provisioning first testing machine"
docker-machine ${MACHINE_ARGS} create -d virtualbox \
    --swarm-master --swarm \
    --engine-opt cluster-store=${CONSUL} --engine-opt cluster-advertise=eth1:2376 \
    --swarm-discovery ${CONSUL} \
    ${MACHINE_NAME}

eval $(docker-machine ${MACHINE_ARGS} env --swarm ${MACHINE_NAME})

# Should all be up and running
# Create a dummy network
echo "Creating test network"
docker network create -d overlay --subnet 192.168.100.0/24 --gateway 192.168.100.1 --gateway 192.168.101.1 --subnet 192.168.101.0/24 --ip-range 192.168.100.0/24 testnet

# Spin up a container we want to keep around to demonstrate
echo "Creating nginx container"
docker run --name nginx1 -d --net testnet --ip 192.168.101.2 nginx
docker run --name nginx2 -d --net testnet --ip 192.168.101.3 nginx
docker run --name nginx3 -d --net testnet --ip 192.168.101.4 nginx
docker run --name nginx4 -d --net testnet --ip 192.168.101.5 nginx

# Now, spin up and down the helloworld image a bunch of times :)
# This will cause the Serf event log to exceed 1024 entries.
# Lets do ~10 at a time
for i in `seq 0 1`; do
    echo "helloworld $i / 1"
    docker run --rm --name demo-container-1 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-2 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-3 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-4 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-5 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-6 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-7 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-8 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-9 --net testnet hello-world > /dev/null 2>&1 &
    docker run --rm --name demo-container-10 --net testnet hello-world > /dev/null 2>&1 &
    wait
done

# Now, create a new machine and join it to the swarm
MACHINE_NAME=test-2
echo "Provisioning second testing machine"
docker-machine ${MACHINE_ARGS} create -d virtualbox \
    --swarm \
    --engine-opt cluster-store=${CONSUL} --engine-opt cluster-advertise=eth1:2376 \
    --swarm-discovery ${CONSUL} \
    ${MACHINE_NAME}

sleep 10

# Now, bring up debian and try and ping the nginx container
# We're still in the swarm, just use a constraint here
echo "Running test ping"
docker run -t --name deb-test --net testnet -e "constraint:node==test-2" debian ping nginx1

Logs: docker.FAIL.log.txt docker.WORKING.log.txt

Looking a little further into the logs, this appears to be the critical part, which is only present in the "working" version:

time="2017-07-24T09:05:30.554187540Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 a2fb12a41f49968973355396a52910ab7d0c4ae19c90cb7e01f673d566be72a6, payload:join 192.168.101.5 255.255.255.0 02:42:c0:a8:65:05 LTime:36 \n"
time="2017-07-24T09:05:30.554293146Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/a2fb12a41f49968973355396a52910ab7d0c4ae19c90cb7e01f673d566be72a6/192.168.99.101/192.168.101.5/255.255.255.0/02:42:c0:a8:65:05\n"
time="2017-07-24T09:05:30.555700694Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 ac0a01d43c22af222e83ae7463df0b77091fd550cc0edc454e482c99d810de22, payload:join 192.168.101.3 255.255.255.0 02:42:c0:a8:65:03 LTime:35 \n"
time="2017-07-24T09:05:30.555752698Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/ac0a01d43c22af222e83ae7463df0b77091fd550cc0edc454e482c99d810de22/192.168.99.101/192.168.101.3/255.255.255.0/02:42:c0:a8:65:03\n"
time="2017-07-24T09:05:30.555773868Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 58a73b83564760a433f51e91311c9251003b1859d2197b8337535be1e0295110, payload:join 192.168.101.4 255.255.255.0 02:42:c0:a8:65:04 LTime:34 \n"
time="2017-07-24T09:05:30.555798635Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/58a73b83564760a433f51e91311c9251003b1859d2197b8337535be1e0295110/192.168.99.101/192.168.101.4/255.255.255.0/02:42:c0:a8:65:04\n"
time="2017-07-24T09:05:30.555825310Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 0652b757a006ae4589303b859cf45f8ba3d625aa82057ab01b101a54e0d0348f, payload:join 192.168.101.2 255.255.255.0 02:42:c0:a8:65:02 LTime:33 \n"
time="2017-07-24T09:05:30.555848429Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/0652b757a006ae4589303b859cf45f8ba3d625aa82057ab01b101a54e0d0348f/192.168.99.101/192.168.101.2/255.255.255.0/02:42:c0:a8:65:02\n"

In this case, it appears M1 is not dumping its peerDb when M2 joins the cluster. This is also present in M1's logs:

time="2017-07-24T09:39:16.157085475Z" level=debug msg="2017/07/24 09:39:16 [DEBUG] memberlist: Failed to join 192.168.99.102: dial tcp 192.168.99.102:7946: getsockopt: connection refused\n"
time="2017-07-24T09:39:16.157133964Z" level=error msg="joining serf neighbor 192.168.99.102 failed: Failed to join the cluster at neigh IP 192.168.99.102: 1 error(s) occurred:\n\n* Failed to join 192.168.99.102: dial tcp 192.168.99.102:7946: getsockopt: connection refused"
time="2017-07-24T09:39:16.158310455Z" level=debug msg="2017/07/24 09:39:16 [DEBUG] memberlist: Stream connection from=192.168.99.102:44528\n"
time="2017-07-24T09:39:16.158975080Z" level=info msg="2017/07/24 09:39:16 [INFO] serf: EventMemberJoin: test-2 192.168.99.102\n"

I assume this is a race between the node being added in Consul and Serf actually being available on the node?

moby / libnetwork