Open bloo opened 6 years ago
I'm going to keep my cluster as-is (partially broken) to assist in triage and inspection. Thanks!
Running containers on each node (with network diag utilities) on the same overlay network as the hello
service, then exec'ing into each of those containers, does not give me any issue. The problem exists somewhere between dockerd
on nodeA
and the ingress
overlay network I'm suspecting.
At Swarm initialization, we're re-creating the ingress
network to work around an already used subnet on our network:
yes | docker network rm ingress
docker network create --driver overlay --ingress \
--subnet 10.100.0.0/16 \
--gateway 10.100.0.1 \
ingress-hello
When running docker inspect ingress-hello
on each node, our ingress-hello-sbox
containers show these settings - which don't look right:
nodeA
"ingress-hello-sbox": {
"Name": "ingress-hello-endpoint",
"EndpointID": "fce5334d6dcedbb50a66d99b70595b5c93ca4cb81674703214ba1fb365051dd3",
"MacAddress": "02:42:0a:64:00:0b",
"IPv4Address": "10.100.0.11/16",
"IPv6Address": ""
}
nodeB
"ingress-hello-sbox": {
"Name": "ingress-hello-endpoint",
"EndpointID": "2fdae6cb1934f9a4a9eb130f9b13bff5501668d625bb201646cf2abb4b20063d",
"MacAddress": "02:42:0a:64:00:02",
"IPv4Address": "10.100.0.2/16",
"IPv6Address": ""
}
nodeC
"ingress-hello-sbox": {
"Name": "ingress-hello-endpoint",
"EndpointID": "f3fdfb3bc19c1257d3549cbdb262953357003cfda0af7372a6ce722cf75ff44a",
"MacAddress": "02:42:0a:64:00:0b",
"IPv4Address": "10.100.0.11/16",
"IPv6Address": ""
}
Similarly, when we docker inspect docker_gwbridge
on each node:
nodeA
"ingress-hello-sbox": {
"Name": "gateway_ingress-hell",
"EndpointID": "8e58e338c0a04840ad2607c57f73b6b76a529a5c11da03efe6c1392e24f92d02",
"MacAddress": "02:42:ac:12:00:04",
"IPv4Address": "172.18.0.4/16",
"IPv6Address": ""
}
nodeB
"ingress-hello-sbox": {
"Name": "gateway_ingress-hell",
"EndpointID": "23f2377320af58680fb29e4b73789d4e6add7068fba70b135a5056acf8296c6f",
"MacAddress": "02:42:ac:12:00:04",
"IPv4Address": "172.18.0.4/16",
"IPv6Address": ""
}
.. and this is the container entry of our hello
service task on docker_gwbridge
:
"cc073a84ce16d81645ff3255b19ee0d1cff09bc7ced92c596e53467f4e77c732": {
"Name": "gateway_cc073a84ce16",
"EndpointID": "af4730ae76f7c54b5c59bb33ae7672b84b7b8b2a95881007f28ef7554a17e1ae",
"MacAddress": "02:42:ac:12:00:07",
"IPv4Address": "172.18.0.7/16",
"IPv6Address": ""
},
nodeC
"ingress-hello-sbox": {
"Name": "gateway_ingress-hell",
"EndpointID": "a1c937fb8c2e0ce83d90fec9d3c1a4f6ea49e3769b8f6f59d44cb52c691c37ff",
"MacAddress": "02:42:ac:12:00:02",
"IPv4Address": "172.18.0.2/16",
"IPv6Address": ""
}
any word on what could be causing these issues?
Description
There's some Docker networking resolution failure that happens with a very specific combination of circumstances. We have Swarm clusters that, over time (not sure yet - whether it's a CoreOS upgrade, random reboot, moon phase?) lose the ability to route requests from one node (an EC2 host) to a container on another specific node. When there are multiple containers running across multiple hosts, the requests from the same specific combination always fails, while all other combinations work.
Steps to reproduce the issue:
nodeA
,nodeB
,nodeC
hello
replicas=1
3333
(bringing Swarm ingress network to node that runs container)DOCKER-INGRESS
rule exists for3333
usingsudo iptables -L
curl localhost:3333
docker service update --force hello
to move the task/container around between nodesDescribe the results you received:
If the only
hello
task is running onnodeA
ornodeC
curl localhost:3333
workIf the only
hello
task is running onnodeB
curl localhost:3333
fromnodeA
timeout 100% of the timecurl localhost:3333
fromnodeB
ornodeC
workif we bump
replicas=2
hello
is running onnodeA
andnodeC
hello
is running onnodeA|C
andnodeB
curl localhost:3333
fromnodeA
works 50% of the timecurl localhost:3333
fromnodeB
ornodeC
workrebooting
nodeA
doesn't helpterminating
nodeA
and having our AWS ASG recreate it fixes the issue (but we won't do that just yet)Describe the results you expected:
Requests to
nodeA
s listening port3333
, when routing to that service's container/task onnodeB
should always work.Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:All 3 nodes:
Output of
docker info
:nodeA
nodeB
nodeC
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS across 3 AZs using CoreOS Container Linux AMIs and identical Launch Configurations.
At Swarm initialization, we're re-creating the ingress network to work around an already used subnet on our network: