Open slugonamission opened 7 years ago
Looking a little further into the logs, this appears to be the critical part, which is only present in the "working" version:
time="2017-07-24T09:05:30.554187540Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 a2fb12a41f49968973355396a52910ab7d0c4ae19c90cb7e01f673d566be72a6, payload:join 192.168.101.5 255.255.255.0 02:42:c0:a8:65:05 LTime:36 \n"
time="2017-07-24T09:05:30.554293146Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/a2fb12a41f49968973355396a52910ab7d0c4ae19c90cb7e01f673d566be72a6/192.168.99.101/192.168.101.5/255.255.255.0/02:42:c0:a8:65:05\n"
time="2017-07-24T09:05:30.555700694Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 ac0a01d43c22af222e83ae7463df0b77091fd550cc0edc454e482c99d810de22, payload:join 192.168.101.3 255.255.255.0 02:42:c0:a8:65:03 LTime:35 \n"
time="2017-07-24T09:05:30.555752698Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/ac0a01d43c22af222e83ae7463df0b77091fd550cc0edc454e482c99d810de22/192.168.99.101/192.168.101.3/255.255.255.0/02:42:c0:a8:65:03\n"
time="2017-07-24T09:05:30.555773868Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 58a73b83564760a433f51e91311c9251003b1859d2197b8337535be1e0295110, payload:join 192.168.101.4 255.255.255.0 02:42:c0:a8:65:04 LTime:34 \n"
time="2017-07-24T09:05:30.555798635Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/58a73b83564760a433f51e91311c9251003b1859d2197b8337535be1e0295110/192.168.99.101/192.168.101.4/255.255.255.0/02:42:c0:a8:65:04\n"
time="2017-07-24T09:05:30.555825310Z" level=debug msg="Received user event name:jl 192.168.99.101 e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849 0652b757a006ae4589303b859cf45f8ba3d625aa82057ab01b101a54e0d0348f, payload:join 192.168.101.2 255.255.255.0 02:42:c0:a8:65:02 LTime:33 \n"
time="2017-07-24T09:05:30.555848429Z" level=debug msg="Parsed data = e12484abfdc4467a43bcc9abe03f5cff2a450f72ff08c652db44d06511153849/0652b757a006ae4589303b859cf45f8ba3d625aa82057ab01b101a54e0d0348f/192.168.99.101/192.168.101.2/255.255.255.0/02:42:c0:a8:65:02\n"
In this case, it appears M1 is not dumping its peerDb when M2 joins the cluster. This is also present in M1's logs:
time="2017-07-24T09:39:16.157085475Z" level=debug msg="2017/07/24 09:39:16 [DEBUG] memberlist: Failed to join 192.168.99.102: dial tcp 192.168.99.102:7946: getsockopt: connection refused\n"
time="2017-07-24T09:39:16.157133964Z" level=error msg="joining serf neighbor 192.168.99.102 failed: Failed to join the cluster at neigh IP 192.168.99.102: 1 error(s) occurred:\n\n* Failed to join 192.168.99.102: dial tcp 192.168.99.102:7946: getsockopt: connection refused"
time="2017-07-24T09:39:16.158310455Z" level=debug msg="2017/07/24 09:39:16 [DEBUG] memberlist: Stream connection from=192.168.99.102:44528\n"
time="2017-07-24T09:39:16.158975080Z" level=info msg="2017/07/24 09:39:16 [INFO] serf: EventMemberJoin: test-2 192.168.99.102\n"
I assume this is a race between the node being added in Consul and Serf actually being available on the node?
We've actually been running into this some more recently. Because of this race, if there is a single-node Docker deployment which is running containers, and another node joins, it sometimes does not properly join Serf because the existing node attempts to probe it before the Serf agent has been fully started. It can then resolve remote hostnames, as the K/V store is working, but cannot communicate with any foreign nodes as Serf is broken.
We've noticed issues when attempting to attach multiple sub-networks onto an overlay network where sometimes connectivity is lost when new machines are added to the pool. In this case, any containers provisioned on the new machines cannot communicate with any containers on the pre-existing machines if the containers are on the different subnets.
If we have an existing machine M1 with container CA on subnet
10.0.0.0/24
, and add a new machine M2 with container CB on subnet10.0.1.0/24
, packets can flow perfectly fine from M1CA -> M2CB, but packets sometimes cannot transit the other way. If it "works", everything is fine (and we can freely restart M2CB with no issues), but if it doesn't work, no amount of restarting M2CB will change anything. Restarting M1CA does work (assumably because the routing info is re-broadcast onto Serf), but this is a nasty workaround in this case.After looking into this deeper, it seems that everything is fine inside the container. M2CB has a routing rule to route
10.0.0.0/24
via10.0.1.1
and the ARP cache has an entry for10.0.1.1
. Outside the container is sometimes ruined though; when it's "broken", the network namespace for the overlay network only contains a single bridge with IP 10.0.1.1 and a single VxLAN device, and does not contain any routing rules for the10.0.0.0/24
subnet, rather than two separate bridges, all appropriate routing table rules, and separate VxLAN devices when everything is "working". It's as ifjoinSubnetSandbox
is never being called on M2 for the10.0.0.0/24
subnet. Of course, when this happens, standard peerDb resolution over Serf cannot happen since there is no network adapter for the10.0.0.0/24
subnet, hence ARP does not take place and the netlink listener does not get notified.I've got readouts from
ip addr
/ip route
inside the VxLAN network namespace for when it's working and broken, as well as debug logs from M2 in both situations and a script to reproduce. In this situation, the overlay network has the IP ranges192.168.100.0/24
and192.168.101.0/24
. M1CA has IP192.168.101.2
, and M2CB has an IP from the192.168.100.0/24
range:Working
Broken
Reproduction script
Sadly, this isn't a one-shot repro, you may have to destroy the VMs this creates a run it a few times. Eventually, the containing running
ping
will report92 bytes from 192.168.100.1: Destination Net Unreachable
Logs: docker.FAIL.log.txt docker.WORKING.log.txt