Open paulcadman opened 8 years ago
I'm seeing the same thing: docker/docker#22144
Hit this issue as well, restarting the container seems to fix it, but isn't ideal.
We're using:
Client:
Version: 1.11.0
API version: 1.23
Go version: go1.5.4
Git commit: 4dc5990
Built: Wed Apr 13 19:36:04 2016
OS/Arch: linux/amd64
Server:
Version: swarm/1.2.3
API version: 1.22
Go version: go1.5.4
Git commit: eaa53c7
Built: Fri May 27 17:25:03 UTC 2016
OS/Arch: linux/amd64
I have a same problem. Just comment to see if there is any update on this.
My env:
Client:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 22:00:43 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 22:00:43 2016
OS/Arch: linux/amd64
Another me too. I can add that for me it is sufficient to ping from the container that cannot do the connect any other container on the same network to fix the issue, the first ping may take slightly longer, but then everything starts to work as intended. I am using Zookeeper.
Some weirdness with ARP? Perhaps relates to this: https://github.com/docker/libnetwork/issues/338 ?
No, ARP seems to be fine, at the container level. I have in front of me a "blocked" container and arp -a
reports the correct mac address for the server container it is trying to connect to. So there is something wrong with IP.
I can reproduce it quite often by running Spark clusters. I would say that one cluster out of five comes out bad with some worker unable to talk to the master. Since we do a lot of automated tests and performance measurements, we get many outliers and basically makes Swarm with overlay networks too unreliable.
There is something fishy in the overlay network namespace. Here is the vxlan fdb on the host on which the "blocked" container is running:
$ sudo nsenter --net=/var/run/docker/netns/1-07e76e17d6 bridge fdb show dev vxlan1
6e:ed:5e:b7:60:b1 vlan 1 permanent
6e:ed:5e:b7:60:b1 permanent
02:42:0a:00:04:11
33:33:00:00:00:01 self permanent
01:00:5e:00:00:01 self permanent
33:33:ff:b7:60:b1 self permanent
02:42:0a:00:04:10 dst 192.168.47.12 self permanent
02:42:0a:00:04:03 dst 192.168.47.19 self permanent
02:42:0a:00:04:0f dst 192.168.47.12 self permanent
02:42:0a:00:04:0e dst 192.168.47.20 self permanent
02:42:0a:00:04:0d dst 192.168.47.20 self permanent
02:42:0a:00:04:0c dst 192.168.47.13 self permanent
02:42:0a:00:04:0b dst 192.168.47.17 self permanent
02:42:0a:00:04:09 dst 192.168.47.15 self permanent
02:42:0a:00:04:08 dst 192.168.47.16 self permanent
02:42:0a:00:04:07 dst 192.168.47.16 self permanent
02:42:0a:00:04:13 dst 192.168.47.21 self permanent
02:42:0a:00:04:06 dst 192.168.47.21 self permanent
02:42:0a:00:04:12 dst 192.168.47.18 self permanent
02:42:0a:00:04:05 dst 192.168.47.18 self permanent
02:42:0a:00:04:11
is the MAC address of the container with which the "blocked" container cannot talk. The other MAC addresses have an IP address, but that one does not.
Any idea, @mavenugo ?
Thanks @dvenza. i will page @mrjana to see if he has any ideas.
I'm also getting this issue too. Unfortunately I know virtually nothing about networking. My setup is pretty much identical to @paulcadman.
I've actually had the issue for some time over the last few docker releases. It got so bad at one point I recreated the whole overlay network and containers and its been good for a few weeks, but its happening again :(
If there's anything I can debug let me know. We're running on EC2 is that helps.
Client:
Version: 1.11.1
API version: 1.23
Go version: go1.5.4
Git commit: 5604cbe
Built: Wed Apr 27 00:34:42 2016
OS/Arch: linux/amd64
Server:
Version: swarm/1.2.0
API version: 1.22
Go version: go1.5.4
Git commit: a6c1f14
Built: Wed Apr 13 05:58:31 UTC 2016
OS/Arch: linux/amd64
Containers: 33
Running: 26
Paused: 0
Stopped: 7
Images: 45
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
swarm-node-eu-west1-1a-2: 10.0.1.193:2376
└ Status: Healthy
└ Containers: 10
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 1.322 GiB / 3.624 GiB
└ Labels: executiondriver=, kernelversion=3.10.0-327.13.1.el7.x86_64, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay
└ Error: (none)
└ UpdatedAt: 2016-08-11T08:13:38Z
└ ServerVersion: 1.11.1
swarm-node-eu-west1-1b-2: 10.0.0.150:2376
└ Status: Healthy
└ Containers: 12
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 1.322 GiB / 3.624 GiB
└ Labels: executiondriver=, kernelversion=3.10.0-327.10.1.el7.x86_64, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay
└ Error: (none)
└ UpdatedAt: 2016-08-11T08:13:30Z
└ ServerVersion: 1.11.1
swarm-node-eu-west1-1c-2: 10.0.2.227:2376
└ Status: Healthy
└ Containers: 11
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 330 MiB / 3.624 GiB
└ Labels: executiondriver=, kernelversion=3.10.0-327.10.1.el7.x86_64, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay
└ Error: (none)
└ UpdatedAt: 2016-08-11T08:13:43Z
└ ServerVersion: 1.11.1
Plugins:
Volume:
Network:
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Operating System: linux
Architecture: amd64
CPUs: 6
Total Memory: 10.87 GiB
Name: fcf67f74802a
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support
Linux support-eu-west1-1a-1 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
On a new deployment with 25 machines this problem blocks network communication for about half of the containers that I create, making it completely worthless.
Any news?
I have face to with same issue. I think overlay network is unreliable in ubuntu.
Server Version: swarm/1.2.3
Client: Version: 1.11.2 API version: 1.23 Go version: go1.5.4 Git commit: b9f10c9 Built: Wed Jun 1 21:47:50 2016 OS/Arch: linux/amd64
Server: Version: 1.11.2 API version: 1.23 Go version: go1.5.4 Git commit: b9f10c9 Built: Wed Jun 1 21:47:50 2016 OS/Arch: linux/amd64
We have the same issue in a docker swarm with thousands of containers. In all the docker nodes (1.10.3) we have an nginx containers which can communicate with different application containers in the overlay network (using consul). Sometimes one of the nginx containers cannot connect to an app container in a different node, receiving the same message in the log:
ERRO[5758943] could not resolve peer "<nil>": timed out resolving peer by querying the cluster
Additionally we are seeing that the app containers which fail always have the same IP addresses. Restarting the app container doesn't work if the container received the same IP address. What work for us is:
We tried to debug where the packets are going without success. There is not activity in the remote node so we think the packet is not leaving the node where the nginx container is running.
We have the exact same issue, with a different environment:
As mentioned above, we have been working around this issue by periodically running ping
from the container that can not be reached back in the opposite direction. In our case a Nginx container often can't connect to a backend container to proxy HTTP requests. All of these containers are on the same frontend
network. We setup a Jenkins job that runs every 5 minutes (could be a cron job too) that finds all of the containers on the frontend
network and execs into them to ping the nginx
container:
export DOCKER_HOST=swarm-manager:3375
CONTAINERS=$(docker network inspect frontend | grep "\"Name\":" | grep -v "\"frontend\"" | grep -v "\"nginx\"" | sort | cut -d "\"" -f 4)
for CONTAINER in $CONTAINERS; do
docker exec -i ${CONTAINER} ping nginx -c 1
done
This seems to keep the VXLAN working (and/or resolve the issue when it does happen) without having to recreate containers or restart anything.
While we indeed managed to get network working again by doing this (pinging in the opposite direction) before, we also encounter a case where we cannot avoid rebooting the machine. Docker Swarm 1.2.6 with docker-ce-17-03-.1 is so not production-ready (eg 5 machines KO in a cluster of 8).
@tfasz @antoinetran this behavior got fixed by https://github.com/docker/libnetwork/pull/1792 will be available on 17.06 and also will be backported to 17.03
Dear all,
We migrated 3 Swarm clusters to docker-ce-17.06.0, 2 weeks ago, and this seems to work fine, until now. I just reproduced this error once. I had to ping back to have the connectivity again. But it seems this error is more rare now.
Any info/log someone want?
@antoinetran To make sure we understand what issue is clearly, can you provide the details on exactly what the connectivity issue is and what the trigger is ?
Environment: In a Swarm (classic mode 1.2.6 with docker-ce-17.06.0-ce CentOs 7.3.1611), we have multiple Swarm node and multiple network overlay. Containers are attached to one or more of these networks, and launched with Docker-compose in Swarm.
Connectivity issue: In a container C1 in VM, I cannot reach (ping/nc) a container C2 in another VM, with its service name. The ping freezes. If, in another window, I ping from container C2 to C1, immediately after, the ping C1->C2 works.
What the trigger is
I have no idea. We are in an integration/validation process. Each 20 min, we do a docker-compose down
of business containers, we change the version, then we do docker-compose up
again. The only abnormal symptom I can see is that one of our business container is slower to start.
@fcrisciani I reproduced this issue a small number of time. Here is my new environment: all latest CentOs 7.4.1708 / docker-ce-17.12.1 / Swarm image 1.2.8.
Same symptom as my last post.
@antoinetran so the weird part is that the ping to the service name is answered locally, there is an iptables rule that reply to the ping to the VIPs directly from the loopback. Can it be only that C1 is really slow in coming up?
Can it be only that C1 is really slow in coming up?
I don't know what you mean by slow but if the ping does not work, I wait maybe a few seconds to be sure.
so the weird part is that the ping to the service name is answered locally, there is an iptables rule that reply to the ping to the VIPs directly from the loopback.
To be more precise, the DNS resolution always work. It is really the ping that does not work (no pong).
@antoinetran I was thinking that C1 was still not ready, but if you can exec into it should not be the case. Do you see anything unusual in the daemon logs or in the kernel logs?
Hard to say, kernel/daemon logs are the first thing I look. There are recurrent errors that does not seems to be errors:
Mar 5 03:09:24 dev-node-003 dockerd: time="2018-03-05T03:09:24.046536158Z" level=error msg="Failed to deserialize netlink ndmsg: Link not found"
Mar 08 18:54:38 dev-node-003.ts-l2pf.cloud-omc.org dockerd[1126]: time="2018-03-08T18:54:38.295062659Z" level=error msg="could not resolve peer \"192.168.101.49\": timed out resolving peer by Mar 09 17:22:29 dev-node-006.ts-l2pf.cloud-omc.org dockerd[1310]: time="2018-03-09T17:22:29.442630899Z" level=error msg="2018/03/09 17:22:29 [ERR] memberlist: Failed fallback ping: read tcp 10.0.1.26:50300->10.0.1.29:7946: i/o timeout\n 716494Z" level=info msg="2018/03/09 17:22:29 [INFO] memberlist: Suspect dev-node-009.ts-l2pf.cloud-omc.org has failed, no acks received\n"
I will try to archive these logs when the event happen. Right now I forgot when it happened.
@antoinetran I see memberlist complaining about one node and marking it as suspect, this means that the physical network is failing delivering the health checks. if the networkdb is not able to communicate with the other nodes, that will explain why the ping fails, because the configuration is not being propagated so c1 is not aware of c2.
@fcrisciani Ok! Thank you for your diagnostic. This event is probably due to an IP collision we had today (from docker default network when we do compose up). That explains a lot. It might also be a cause of network lost during my other post in old environments.
@antoinetran
in 17.12, I added several logs that you can use to check the status of the networkDB when you see weird behavior.
First is: NetworkDB stats
, it will print stats per network like:
Feb 23 05:38:06 ip-172-31-19-13 dockerd[8276]: time="2018-02-23T05:38:06.269354743Z" level=info msg="NetworkDB stats ip-172-31-19-13(566e4c57c00e) - netID:3r6rqkvee3l4c7dx3c9fmf2a8 leaving:false netPeers:3 entries:12 Queue qLen:0 netMsg/s:0"
These are printed every 5 min and is 1 line per network. This tells you that on the specific network there is 3 nodes that have container deployed on it. Also if there is connectivity issue you will see a line mentioning healthscore
, this is a number coming directly from memberlist and the higher is the value the more severe is the problem. If you see healthscore mentioned then it's a good indicator that the underlay network can be acting funky so is good to start from the bottom the debug.
Same issue with 17.12.1
make sure your firewall is open for port needed for overlay networks https://docs.docker.com/network/overlay/#publish-ports-on-an-overlay-network
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network traffic
Description of problem:
Very rarely (observed twice after using 1000s of containers) we start a new container into an overlay network in a docker swarm. Existing containers in the overlay network that are on different nodes cannot connect to the new container. However containers in the overlay network on the same node as the new container are able to connect.
The new container receives an IP address in the overlay network subnet, but this does not seem to work correctly when resolved from a different node.
The second time this happened we fixed the problem by stopping and starting the new container.
We haven't found a way to reliably reproduce this problem. Is there any other debugging I can provide that would help diagnose this issue?
The error message is the same as the one reported on https://github.com/docker/libnetwork/issues/617.
docker version
:docker info
:uname -a
:Environment details (AWS, VirtualBox, physical, etc.):
Physical - docker swarm cluster.
How reproducible:
Rare - happened 2 times after creating/starting 1000s of containers.
Steps to Reproduce:
Actual Results:
Get a connection timeout. For example with the golang http client:
10.158.0.60 is the address of the container in step 2 in the overlay network subnet.
The docker logs on the swarm node that launched the container in step 2 contain (from
journalctl -u docker
):We see a line like this for each failed request between the containers.
When we make the same request from a container in the overlay network on the same swarm node as the container running the http server the expected connection is established and a response is received.
Expected Results:
The http client receieves a response from the container its trying to connect to.
Additional info:
The second time this occurred we fixed the problem by stopping and starting the container running the http server.
We are using Consul as the KV store of the overlay network and swarm.
When removing the container that cannot be connected to, docker logs (
journalctl -u docker
) contain the line:The docker log lines are emitted by https://github.com/docker/libnetwork/blob/master/drivers/overlay/ov_serf.go#L180. I can't find an existing issue tracking this.