Cannot connect to a container in an overlay network from a different swarm node: `could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster`

paulcadman commented 8 years ago

Description of problem:

Very rarely (observed twice after using 1000s of containers) we start a new container into an overlay network in a docker swarm. Existing containers in the overlay network that are on different nodes cannot connect to the new container. However containers in the overlay network on the same node as the new container are able to connect.

The new container receives an IP address in the overlay network subnet, but this does not seem to work correctly when resolved from a different node.

The second time this happened we fixed the problem by stopping and starting the new container.

We haven't found a way to reliably reproduce this problem. Is there any other debugging I can provide that would help diagnose this issue?

The error message is the same as the one reported on https://github.com/docker/libnetwork/issues/617.

docker version:

Client:
 Version:      1.10.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   590d5108
 Built:        Thu Feb  4 19:04:33 2016
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.1.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   a0fd82b
 Built:        Thu Feb  4 08:55:18 UTC 2016
 OS/Arch:      linux/amd64

docker info:

Containers: 102
 Running: 53
 Paused: 0
 Stopped: 49
Images: 372
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
 glera.int.corefiling.com: 10.0.0.57:2375
  └ Status: Healthy
  └ Containers: 32
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 32.94 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:16Z
 kafue.int.corefiling.com: 10.0.0.17:2375
  └ Status: Healthy
  └ Containers: 36
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 16.4 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:20Z
 paar.int.corefiling.com: 10.0.1.1:2375
  └ Status: Healthy
  └ Containers: 34
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 16.44 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:31Z
Plugins:
 Volume:
 Network:
Kernel Version: 4.2.6-201.fc22.x86_64
Operating System: linux
Architecture: amd64
CPUs: 12
Total Memory: 65.77 GiB
Name: 9dd94ffb6aea

uname -a:

Linux glera.int.corefiling.com 4.2.6-201.fc22.x86_64 #1 SMP Tue Nov 24 18:42:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Environment details (AWS, VirtualBox, physical, etc.):

Physical - docker swarm cluster.

How reproducible:

Rare - happened 2 times after creating/starting 1000s of containers.

Steps to Reproduce:

Create/start a container in an overlay network
In the same overlay network create/start a container on a different host in the swarm. A process in the container is listening on port 80 and this port is exposed to the overlay network.
Try to connect to the container of step 2 from within the container of step 1 with http client.
Actual Results:

Get a connection timeout. For example with the golang http client:

 http: proxy error: dial tcp 10.158.0.60:80: i/o timeout

10.158.0.60 is the address of the container in step 2 in the overlay network subnet.

The docker logs on the swarm node that launched the container in step 2 contain (from journalctl -u docker):

level=error msg="could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster".

We see a line like this for each failed request between the containers.

When we make the same request from a container in the overlay network on the same swarm node as the container running the http server the expected connection is established and a response is received.

Expected Results:

The http client receieves a response from the container its trying to connect to.

Additional info:

The second time this occurred we fixed the problem by stopping and starting the container running the http server.

We are using Consul as the KV store of the overlay network and swarm.

When removing the container that cannot be connected to, docker logs (journalctl -u docker) contain the line:

error msg="Peer delete failed in the driver: could not delete fdb entry into the sandbox: could not delete neighbor entry: no such file or directory\n"

The docker log lines are emitted by https://github.com/docker/libnetwork/blob/master/drivers/overlay/ov_serf.go#L180. I can't find an existing issue tracking this.

byrnedo commented 8 years ago

I'm seeing the same thing: docker/docker#22144

jsumali commented 8 years ago

Hit this issue as well, restarting the container seems to fix it, but isn't ideal.

We're using:

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 19:36:04 2016
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.2.3
 API version:  1.22
 Go version:   go1.5.4
 Git commit:   eaa53c7
 Built:        Fri May 27 17:25:03 UTC 2016
 OS/Arch:      linux/amd64

longkt90 commented 8 years ago

I have a same problem. Just comment to see if there is any update on this.

My env:

Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 22:00:43 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 22:00:43 2016
 OS/Arch:      linux/amd64

dvenza commented 8 years ago

Another me too. I can add that for me it is sufficient to ping from the container that cannot do the connect any other container on the same network to fix the issue, the first ping may take slightly longer, but then everything starts to work as intended. I am using Zookeeper.

Some weirdness with ARP? Perhaps relates to this: https://github.com/docker/libnetwork/issues/338 ?

dvenza commented 8 years ago

No, ARP seems to be fine, at the container level. I have in front of me a "blocked" container and arp -a reports the correct mac address for the server container it is trying to connect to. So there is something wrong with IP.

I can reproduce it quite often by running Spark clusters. I would say that one cluster out of five comes out bad with some worker unable to talk to the master. Since we do a lot of automated tests and performance measurements, we get many outliers and basically makes Swarm with overlay networks too unreliable.

dvenza commented 8 years ago

There is something fishy in the overlay network namespace. Here is the vxlan fdb on the host on which the "blocked" container is running:

$ sudo nsenter --net=/var/run/docker/netns/1-07e76e17d6 bridge fdb show dev vxlan1
6e:ed:5e:b7:60:b1 vlan 1 permanent
6e:ed:5e:b7:60:b1 permanent
02:42:0a:00:04:11
33:33:00:00:00:01 self permanent
01:00:5e:00:00:01 self permanent
33:33:ff:b7:60:b1 self permanent
02:42:0a:00:04:10 dst 192.168.47.12 self permanent
02:42:0a:00:04:03 dst 192.168.47.19 self permanent
02:42:0a:00:04:0f dst 192.168.47.12 self permanent
02:42:0a:00:04:0e dst 192.168.47.20 self permanent
02:42:0a:00:04:0d dst 192.168.47.20 self permanent
02:42:0a:00:04:0c dst 192.168.47.13 self permanent
02:42:0a:00:04:0b dst 192.168.47.17 self permanent
02:42:0a:00:04:09 dst 192.168.47.15 self permanent
02:42:0a:00:04:08 dst 192.168.47.16 self permanent
02:42:0a:00:04:07 dst 192.168.47.16 self permanent
02:42:0a:00:04:13 dst 192.168.47.21 self permanent
02:42:0a:00:04:06 dst 192.168.47.21 self permanent
02:42:0a:00:04:12 dst 192.168.47.18 self permanent
02:42:0a:00:04:05 dst 192.168.47.18 self permanent

02:42:0a:00:04:11 is the MAC address of the container with which the "blocked" container cannot talk. The other MAC addresses have an IP address, but that one does not.

Any idea, @mavenugo ?

mavenugo commented 8 years ago

Thanks @dvenza. i will page @mrjana to see if he has any ideas.

adamlc commented 8 years ago

I'm also getting this issue too. Unfortunately I know virtually nothing about networking. My setup is pretty much identical to @paulcadman.

I've actually had the issue for some time over the last few docker releases. It got so bad at one point I recreated the whole overlay network and containers and its been good for a few weeks, but its happening again :(

If there's anything I can debug let me know. We're running on EC2 is that helps.

Client:
 Version:      1.11.1
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   5604cbe
 Built:        Wed Apr 27 00:34:42 2016
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.2.0
 API version:  1.22
 Go version:   go1.5.4
 Git commit:   a6c1f14
 Built:        Wed Apr 13 05:58:31 UTC 2016
 OS/Arch:      linux/amd64

Containers: 33
 Running: 26
 Paused: 0
 Stopped: 7
Images: 45
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
 swarm-node-eu-west1-1a-2: 10.0.1.193:2376
  └ Status: Healthy
  └ Containers: 10
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 1.322 GiB / 3.624 GiB
  └ Labels: executiondriver=, kernelversion=3.10.0-327.13.1.el7.x86_64, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay
  └ Error: (none)
  └ UpdatedAt: 2016-08-11T08:13:38Z
  └ ServerVersion: 1.11.1
 swarm-node-eu-west1-1b-2: 10.0.0.150:2376
  └ Status: Healthy
  └ Containers: 12
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 1.322 GiB / 3.624 GiB
  └ Labels: executiondriver=, kernelversion=3.10.0-327.10.1.el7.x86_64, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay
  └ Error: (none)
  └ UpdatedAt: 2016-08-11T08:13:30Z
  └ ServerVersion: 1.11.1
 swarm-node-eu-west1-1c-2: 10.0.2.227:2376
  └ Status: Healthy
  └ Containers: 11
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 330 MiB / 3.624 GiB
  └ Labels: executiondriver=, kernelversion=3.10.0-327.10.1.el7.x86_64, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay
  └ Error: (none)
  └ UpdatedAt: 2016-08-11T08:13:43Z
  └ ServerVersion: 1.11.1
Plugins:
 Volume:
 Network:
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Operating System: linux
Architecture: amd64
CPUs: 6
Total Memory: 10.87 GiB
Name: fcf67f74802a
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support

Linux support-eu-west1-1a-1 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

dvenza commented 8 years ago

On a new deployment with 25 machines this problem blocks network communication for about half of the containers that I create, making it completely worthless.

Any news?

lamhaison commented 8 years ago

I have face to with same issue. I think overlay network is unreliable in ubuntu.

Server Version: swarm/1.2.3

Client: Version: 1.11.2 API version: 1.23 Go version: go1.5.4 Git commit: b9f10c9 Built: Wed Jun 1 21:47:50 2016 OS/Arch: linux/amd64

Server: Version: 1.11.2 API version: 1.23 Go version: go1.5.4 Git commit: b9f10c9 Built: Wed Jun 1 21:47:50 2016 OS/Arch: linux/amd64

hdanniel commented 7 years ago

We have the same issue in a docker swarm with thousands of containers. In all the docker nodes (1.10.3) we have an nginx containers which can communicate with different application containers in the overlay network (using consul). Sometimes one of the nginx containers cannot connect to an app container in a different node, receiving the same message in the log:

ERRO[5758943] could not resolve peer "<nil>": timed out resolving peer by querying the cluster

Additionally we are seeing that the app containers which fail always have the same IP addresses. Restarting the app container doesn't work if the container received the same IP address. What work for us is:

Stop docker (where the nginx container is running), clean iptables, start docker
Clean the network namespaces in the docker node where the nginx container is running. This work but not inmediatly
After some hours the IP address start to answer without any apparent reason.

We tried to debug where the packets are going without success. There is not activity in the remote node so we think the packet is not leaving the node where the nginx container is running.

antoinetran commented 7 years ago

We have the exact same issue, with a different environment:

Swarm (classic image mode 1.2.6)
docker-engine CE 17.03.1 Have to reboot machines to have a working Swarm overlay network again.

tfasz commented 7 years ago

As mentioned above, we have been working around this issue by periodically running ping from the container that can not be reached back in the opposite direction. In our case a Nginx container often can't connect to a backend container to proxy HTTP requests. All of these containers are on the same frontend network. We setup a Jenkins job that runs every 5 minutes (could be a cron job too) that finds all of the containers on the frontend network and execs into them to ping the nginx container:

export DOCKER_HOST=swarm-manager:3375
CONTAINERS=$(docker network inspect frontend | grep "\"Name\":" | grep -v "\"frontend\"" | grep -v "\"nginx\"" | sort | cut -d "\"" -f 4)
for CONTAINER in $CONTAINERS; do 
  docker exec -i ${CONTAINER} ping nginx -c 1
done

This seems to keep the VXLAN working (and/or resolve the issue when it does happen) without having to recreate containers or restart anything.

antoinetran commented 7 years ago

While we indeed managed to get network working again by doing this (pinging in the opposite direction) before, we also encounter a case where we cannot avoid rebooting the machine. Docker Swarm 1.2.6 with docker-ce-17-03-.1 is so not production-ready (eg 5 machines KO in a cluster of 8).

fcrisciani commented 7 years ago

@tfasz @antoinetran this behavior got fixed by https://github.com/docker/libnetwork/pull/1792 will be available on 17.06 and also will be backported to 17.03

antoinetran commented 7 years ago

Dear all,

We migrated 3 Swarm clusters to docker-ce-17.06.0, 2 weeks ago, and this seems to work fine, until now. I just reproduced this error once. I had to ping back to have the connectivity again. But it seems this error is more rare now.

Any info/log someone want?

sanimej commented 7 years ago

@antoinetran To make sure we understand what issue is clearly, can you provide the details on exactly what the connectivity issue is and what the trigger is ?

antoinetran commented 7 years ago

Environment: In a Swarm (classic mode 1.2.6 with docker-ce-17.06.0-ce CentOs 7.3.1611), we have multiple Swarm node and multiple network overlay. Containers are attached to one or more of these networks, and launched with Docker-compose in Swarm.

Connectivity issue: In a container C1 in VM, I cannot reach (ping/nc) a container C2 in another VM, with its service name. The ping freezes. If, in another window, I ping from container C2 to C1, immediately after, the ping C1->C2 works.

What the trigger is I have no idea. We are in an integration/validation process. Each 20 min, we do a docker-compose down of business containers, we change the version, then we do docker-compose up again. The only abnormal symptom I can see is that one of our business container is slower to start.

antoinetran commented 6 years ago

@fcrisciani I reproduced this issue a small number of time. Here is my new environment: all latest CentOs 7.4.1708 / docker-ce-17.12.1 / Swarm image 1.2.8.

Same symptom as my last post.

fcrisciani commented 6 years ago

@antoinetran so the weird part is that the ping to the service name is answered locally, there is an iptables rule that reply to the ping to the VIPs directly from the loopback. Can it be only that C1 is really slow in coming up?

antoinetran commented 6 years ago

Can it be only that C1 is really slow in coming up?

I don't know what you mean by slow but if the ping does not work, I wait maybe a few seconds to be sure.

so the weird part is that the ping to the service name is answered locally, there is an iptables rule that reply to the ping to the VIPs directly from the loopback.

To be more precise, the DNS resolution always work. It is really the ping that does not work (no pong).

fcrisciani commented 6 years ago

@antoinetran I was thinking that C1 was still not ready, but if you can exec into it should not be the case. Do you see anything unusual in the daemon logs or in the kernel logs?

antoinetran commented 6 years ago

Hard to say, kernel/daemon logs are the first thing I look. There are recurrent errors that does not seems to be errors:

Mar 5 03:09:24 dev-node-003 dockerd: time="2018-03-05T03:09:24.046536158Z" level=error msg="Failed to deserialize netlink ndmsg: Link not found"

Mar 08 18:54:38 dev-node-003.ts-l2pf.cloud-omc.org dockerd[1126]: time="2018-03-08T18:54:38.295062659Z" level=error msg="could not resolve peer \"192.168.101.49\": timed out resolving peer by Mar 09 17:22:29 dev-node-006.ts-l2pf.cloud-omc.org dockerd[1310]: time="2018-03-09T17:22:29.442630899Z" level=error msg="2018/03/09 17:22:29 [ERR] memberlist: Failed fallback ping: read tcp 10.0.1.26:50300->10.0.1.29:7946: i/o timeout\n 716494Z" level=info msg="2018/03/09 17:22:29 [INFO] memberlist: Suspect dev-node-009.ts-l2pf.cloud-omc.org has failed, no acks received\n"

I will try to archive these logs when the event happen. Right now I forgot when it happened.

fcrisciani commented 6 years ago

@antoinetran I see memberlist complaining about one node and marking it as suspect, this means that the physical network is failing delivering the health checks. if the networkdb is not able to communicate with the other nodes, that will explain why the ping fails, because the configuration is not being propagated so c1 is not aware of c2.

antoinetran commented 6 years ago

@fcrisciani Ok! Thank you for your diagnostic. This event is probably due to an IP collision we had today (from docker default network when we do compose up). That explains a lot. It might also be a cause of network lost during my other post in old environments.

fcrisciani commented 6 years ago

@antoinetran in 17.12, I added several logs that you can use to check the status of the networkDB when you see weird behavior. First is: NetworkDB stats, it will print stats per network like:

Feb 23 05:38:06 ip-172-31-19-13 dockerd[8276]: time="2018-02-23T05:38:06.269354743Z" level=info msg="NetworkDB stats ip-172-31-19-13(566e4c57c00e) - netID:3r6rqkvee3l4c7dx3c9fmf2a8 leaving:false netPeers:3 entries:12 Queue qLen:0 netMsg/s:0"

These are printed every 5 min and is 1 line per network. This tells you that on the specific network there is 3 nodes that have container deployed on it. Also if there is connectivity issue you will see a line mentioning healthscore, this is a number coming directly from memberlist and the higher is the value the more severe is the problem. If you see healthscore mentioned then it's a good indicator that the underlay network can be acting funky so is good to start from the bottom the debug.

mhemrg commented 5 years ago

Same issue with 17.12.1

jerradpatch commented 4 years ago

make sure your firewall is open for port needed for overlay networks https://docs.docker.com/network/overlay/#publish-ports-on-an-overlay-network

TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network traffic

moby / libnetwork