Swarm overlay network does not routing IP address (without userns-remap) after restart nodes

umegaya commented 7 years ago

Description I use 3 node swarm with 3 manager on AWS, each node created by docker-machine (ami-87b917e4) after restart nodes, some of container cannot communicate each other via IP address and service name.

Steps to produce the issue:

create network

docker network create --driver overlay --subnet 10.0.150.0/24 prod-nw

create 4 backend service and 1 frontend service, which is global mode. note that each container has at least 1 publish setting (I omit fluent logger setting to simplify)

docker service create --name backend-1 --replicas 1 --with-registry-auth --network prod-nw --publish 8200:8082 $(backend-1-image)
docker service create --name backend-2 --replicas 1 --with-registry-auth --network prod-nw --publish 8201:8082 $(backend-2-image)
docker service create --name backend-3 --replicas 1 --with-registry-auth --network prod-nw --publish 8100:8082 $(backend-3-image) 
docker service create --name backend-4 --replicas 1 --with-registry-auth --network prod-nw --publish 8101:8082 $(backend-4-image)
docker service create --name frontend --mode global --publish mode=host,published=50051,target=50051 --with-registry-auth --network prod-nw --publish mode=host,published=8082,target=8082 $(frontend-image)

after restart nodes, try to connect to the other service via the DNS entry/VIP

Describe the results you received:

each container had following IPs on prod-nw:

backend-1:  10.0.150.30
backend-2: 10.0.150.12
backend-3: 10.0.150.32
backend-4: 10.0.150.17
frontend-1: 10.0.150.15
frontend-2: 10.0.150.4
frontend-3: 10.0.150.9

most of connectivity work well except:

frontend-1 <-> backend-4
frontend-2 <-> backend-2
backend-2 -> frontend-3 (weird, because connection from frontend-3 to backend-2 seems to be established)

and if connectivity lost, even with direct IP, got following errors:

No route to host at 10.0.150.12 (backend-2 -> frontend-3)

$ telnet 10.0.150.9 50051
Trying 10.0.150.9...
telnet: Unable to connect to remote host: No route to host
$ netstat -an | grep ESTABLISHED # report connection established
tcp        0      0 10.0.150.12:50051       10.0.150.9:53242        ESTABLISHED
tcp        0      0 10.0.150.12:50051       10.0.150.15:55472       ESTABLISHED

Connection timed out at 10.0.150.17 (backend-4 -> frontend-1)

telnet 10.0.150.15 50051
Trying 10.0.150.15...
telnet: Unable to connect to remote host: Connection timed out

Describe the results you expected: I expected to be able to connect to the service using the VIP created for the service and route accordingly.

Additional information you deem important (e.g. issue happens only occasionally): its similar to #26106, but a few difference, so suggested to create as new issue:

using docker-machine created AWS docker instance (ubuntu 16.04 LTS)
I do not explicitly specify userns-remap setting (I'm not sure implicitly set)
not only container name, but also specifying direct IP does not work (No route to host)

Output of docker version:

Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 55
 Running: 7
 Paused: 0
 Stopped: 48
Images: 74
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 281
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: kcpanuat85bztrvktep186fg8
 Is Manager: true
 ClusterID: qclswzn5foalbgmlkhh2e95i6
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.32.11.239
 Manager Addresses:
  172.32.11.239:2377
  172.32.11.40:2377
  172.32.2.28:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-79-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: swarm-master
ID: YAOV:4AKS:YOJL:GKDF:HHTV:XW24:ZMOI:M7HU:7T2Q:E5PZ:5KW4:45FI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 provider=amazonec2
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): AWS, 3 node swarm, 3 manager, each node created by docker-machine (ami-87b917e4)

umegaya commented 7 years ago

update: add traceroute output for each connection error case.

no route to host case

root@93a8e0c8f6af:/# telnet 10.0.150.32 50051
Trying 10.0.150.32...
Connected to 10.0.150.32.
Escape character is '^]'.

?^CConnection closed by foreign host.
root@93a8e0c8f6af:/# traceroute 10.0.150.32
traceroute to 10.0.150.32 (10.0.150.32), 64 hops max
  1   10.0.150.32  0.002ms  0.001ms  0.001ms 
root@93a8e0c8f6af:/# telnet 10.0.150.17 50051
Trying 10.0.150.17...
telnet: Unable to connect to remote host: No route to host
root@93a8e0c8f6af:/# traceroute 10.0.150.17
traceroute to 10.0.150.17 (10.0.150.17), 64 hops max
  1   10.0.150.15  2998.070ms !H  *  0.805ms !H 
root@93a8e0c8f6af:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         ip-172-18-0-1.a 0.0.0.0         UG    0      0        0 eth1
10.0.150.0      0.0.0.0         255.255.255.0   U     0      0        0 eth0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth1
root@93a8e0c8f6af:/#

connection timeout case

root@b2c68c626491:/# telnet 10.0.150.32 50051
Trying 10.0.150.32...
telnet: Unable to connect to remote host: Connection timed out
root@b2c68c626491:/# traceroute 10.0.150.32
traceroute to 10.0.150.32 (10.0.150.32), 64 hops max
  1   *  *  * 
  2   *  *  * 
  3   *  *  * 
  4   *  *  * 
  5   * ^C
root@b2c68c626491:/# telnet 10.0.150.30 50051
Trying 10.0.150.30...
Connected to 10.0.150.30.
Escape character is '^]'.

?^CConnection closed by foreign host.
root@b2c68c626491:/# traceroute 10.0.150.30
traceroute to 10.0.150.30 (10.0.150.30), 64 hops max
  1   10.0.150.30  0.002ms  0.001ms  0.002ms 
root@b2c68c626491:/# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         ip-172-18-0-1.a 0.0.0.0         UG    0      0        0 eth1
10.0.150.0      0.0.0.0         255.255.255.0   U     0      0        0 eth0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 eth1

umegaya commented 7 years ago

more update: one by one restart nodes seems to solve this problem (re-create service does not solve). first time I restart nodes concurrently like docker-machine start/regenerate-certs node1 node2 node3... so this issue may related with initializing swarm cluster.

more update2: sorry but after update service several times, it seems to happen again

thaJeztah commented 7 years ago

ping @sanimej

sanimej commented 7 years ago

@umegaya Can you try the 17.06 CE version ? Before looking into the details of the issue lets confirm if its still seen with the latest.

umegaya commented 7 years ago

EDIT: ok I understand 17.05 has postponement for docker daemon command. I need to edit start script, but it works. now issue not happen. but nature of this problem, it may happen after service updated, I report here when it happens.

@sanimej I try upgrade docker machine with docker-machine upgrade node1 node2 node3 (because brand new machine I didn't have problem) then got error:

Waiting for SSH to be available...
Waiting for SSH to be available...
Waiting for SSH to be available...
Detecting the provisioner...
Detecting the provisioner...
Detecting the provisioner...
Waiting for SSH to be available...
Detecting the provisioner...
Waiting for SSH to be available...
Waiting for SSH to be available...
Detecting the provisioner...
Detecting the provisioner...
Installing Docker...
Installing Docker...
Installing Docker...
error installing docker: 
error installing docker: 
error installing docker:

syslog of one of node says:

Jul 20 23:51:42 mgo-prod-m systemd[1]: Starting Docker Application Container Engine...
Jul 20 23:51:42 mgo-prod-m docker[17401]: `docker daemon` is not supported on Linux. Please run `dockerd` directly 
Jul 20 23:51:42 mgo-prod-m systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE

second line already shows before upgrading docker-machine, and dockerd seems to run that case. this shows already some of docker persistent status broken? or another problem?

thaJeztah commented 7 years ago

docker daemon is not supported on Linux. Please run dockerd directly

That's caused by a combination of issues; the docker binary in 17.06 does not have the daemon subcommand (its deprecated, but should still work in 17.06; this will be fixed in 17.06.1). The second issue is that docker-machine created a systemd override-file with the wrong command for the version of docker that's used; check this file on those machines; /etc/systemd/system/docker.service.d/10-machine.conf, change docker daemon to dockerd, then systemctl daemon-reload, and systemctl restart docker.service

moby / moby

Swarm overlay network does not routing IP address (without userns-remap) after restart nodes #34165

no route to host case

connection timeout case