Open umegaya opened 7 years ago
update: add traceroute output for each connection error case.
root@93a8e0c8f6af:/# telnet 10.0.150.32 50051
Trying 10.0.150.32...
Connected to 10.0.150.32.
Escape character is '^]'.
?^CConnection closed by foreign host.
root@93a8e0c8f6af:/# traceroute 10.0.150.32
traceroute to 10.0.150.32 (10.0.150.32), 64 hops max
1 10.0.150.32 0.002ms 0.001ms 0.001ms
root@93a8e0c8f6af:/# telnet 10.0.150.17 50051
Trying 10.0.150.17...
telnet: Unable to connect to remote host: No route to host
root@93a8e0c8f6af:/# traceroute 10.0.150.17
traceroute to 10.0.150.17 (10.0.150.17), 64 hops max
1 10.0.150.15 2998.070ms !H * 0.805ms !H
root@93a8e0c8f6af:/# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default ip-172-18-0-1.a 0.0.0.0 UG 0 0 0 eth1
10.0.150.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth1
root@93a8e0c8f6af:/#
root@b2c68c626491:/# telnet 10.0.150.32 50051
Trying 10.0.150.32...
telnet: Unable to connect to remote host: Connection timed out
root@b2c68c626491:/# traceroute 10.0.150.32
traceroute to 10.0.150.32 (10.0.150.32), 64 hops max
1 * * *
2 * * *
3 * * *
4 * * *
5 * ^C
root@b2c68c626491:/# telnet 10.0.150.30 50051
Trying 10.0.150.30...
Connected to 10.0.150.30.
Escape character is '^]'.
?^CConnection closed by foreign host.
root@b2c68c626491:/# traceroute 10.0.150.30
traceroute to 10.0.150.30 (10.0.150.30), 64 hops max
1 10.0.150.30 0.002ms 0.001ms 0.002ms
root@b2c68c626491:/# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default ip-172-18-0-1.a 0.0.0.0 UG 0 0 0 eth1
10.0.150.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth1
more update: one by one restart nodes seems to solve this problem (re-create service does not solve). first time I restart nodes concurrently like docker-machine start/regenerate-certs node1 node2 node3... so this issue may related with initializing swarm cluster.
more update2: sorry but after update service several times, it seems to happen again
ping @sanimej
@umegaya Can you try the 17.06 CE version ? Before looking into the details of the issue lets confirm if its still seen with the latest.
EDIT: ok I understand 17.05 has postponement for docker daemon command. I need to edit start script, but it works. now issue not happen. but nature of this problem, it may happen after service updated, I report here when it happens.
@sanimej I try upgrade docker machine with docker-machine upgrade node1 node2 node3
(because brand new machine I didn't have problem)
then got error:
Waiting for SSH to be available...
Waiting for SSH to be available...
Waiting for SSH to be available...
Detecting the provisioner...
Detecting the provisioner...
Detecting the provisioner...
Waiting for SSH to be available...
Detecting the provisioner...
Waiting for SSH to be available...
Waiting for SSH to be available...
Detecting the provisioner...
Detecting the provisioner...
Installing Docker...
Installing Docker...
Installing Docker...
error installing docker:
error installing docker:
error installing docker:
syslog of one of node says:
Jul 20 23:51:42 mgo-prod-m systemd[1]: Starting Docker Application Container Engine...
Jul 20 23:51:42 mgo-prod-m docker[17401]: `docker daemon` is not supported on Linux. Please run `dockerd` directly
Jul 20 23:51:42 mgo-prod-m systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
second line already shows before upgrading docker-machine, and dockerd seems to run that case. this shows already some of docker persistent status broken? or another problem?
docker daemon
is not supported on Linux. Please rundockerd
directly
That's caused by a combination of issues; the docker
binary in 17.06 does not have the daemon
subcommand (its deprecated, but should still work in 17.06; this will be fixed in 17.06.1). The second issue is that docker-machine created a systemd override-file with the wrong command for the version of docker that's used; check this file on those machines; /etc/systemd/system/docker.service.d/10-machine.conf
, change docker daemon
to dockerd
, then systemctl daemon-reload
, and systemctl restart docker.service
Description I use 3 node swarm with 3 manager on AWS, each node created by docker-machine (ami-87b917e4) after restart nodes, some of container cannot communicate each other via IP address and service name.
Steps to produce the issue:
create network
create 4 backend service and 1 frontend service, which is global mode. note that each container has at least 1 publish setting (I omit fluent logger setting to simplify)
after restart nodes, try to connect to the other service via the DNS entry/VIP
Describe the results you received:
each container had following IPs on prod-nw:
most of connectivity work well except:
and if connectivity lost, even with direct IP, got following errors:
Describe the results you expected: I expected to be able to connect to the service using the VIP created for the service and route accordingly.
Additional information you deem important (e.g. issue happens only occasionally): its similar to #26106, but a few difference, so suggested to create as new issue:
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.): AWS, 3 node swarm, 3 manager, each node created by docker-machine (ami-87b917e4)