Closed mschirrmeister closed 8 years ago
I am at the moment a little bit baffled. If you have any tips where I can look to provide more inside information, then please let me know and I can provide this. Right now it looks to me that some internal load balancing/ipvs stuff is choking?
I met the same problem with a 3-node setup. I brought up a service with 5 replicas using following command:
docker service create --name helloworld --replicas 5 --publish 8888:80 dockercloud/hello-world
When I curl one-node:8888
, it chokes sometimes. As dockercloud/hello-world
image returns container ID, I compared all container IDs and found out that one container was never reached. Then I killed that container, swarm brought up a new one, and curl
wouldn't stuck anymore.
My three nodes are located at AWS Tokyo, Vultr Tokyo and DigitalOcean SGP1.
I think this may be a duplicate of https://github.com/docker/docker/issues/25219 or https://github.com/docker/docker/issues/25130 could you have a look at those?
@mschirrmeister Is the problem still there if you started a service with 3 replicas instead of starting the service with one replica and then scaling up?
@mrjana Yes, the problem is still there, even if I start the service with the option --replicate 3
.
When creating/deleting and querying services I watched today syslog for errors on the host and I saw the following. Not sure how bad that is, or if it is helpful.
querying a service with curl
Aug 3 08:33:52 azeausdockerapps301t dockerd: time="2016-08-03T08:33:52.810065000Z" level=error msg="could not resolve peer \"10.255.0.3\": could not resolve peer: serf instance not initialized"
adding a service
Aug 3 08:25:57 azeausdockerapps302t dockerd: time="2016-08-03T08:25:57.731040585Z" level=error msg="Failed to create real server 10.255.0.11 for vip 10.255.0.10 fwmark 289 in sb d554fb6136ecda3acba06c2b936d235e17cc1273c21d655e1d2d13448fec2825: no such process"
deleting a service
Aug 3 08:36:35 azeausdockerapps303t dockerd: time="2016-08-03T08:36:35.122386868Z" level=info msg="Failed to delete real server 10.255.0.12 for vip 10.255.0.10 fwmark 294: no such file or directory"
I wasn't able to reproduce it using Boot2Docker version 1.12.0
VMs. So, it seems that the issue happens occasionally indeed.
rogaha@Robertos-MacBook-Pro:~$ docker network create --driver overlay whoami-net 7:37:52
cke02aohtbpspx5so5gc6a76x
rogaha@Robertos-MacBook-Pro:~$ docker network ls | grep whoami-net 8:15:07
cke02aohtbps whoami-net overlay swarm
rogaha@Robertos-MacBook-Pro:~$ docker service create --name service1 --network whoami-net -p 8000 jwilder/whoami
cojb16cncgj76z0sslxt015bc
rogaha@Robertos-MacBook-Pro:~$ docker service scale service1=3
service1 scaled to 3
rogaha@Robertos-MacBook-Pro:~$ docker-machine ssh node3 8:29:42
## .
## ## ## ==
## ## ## ## ## ===
/"""""""""""""""""\___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\_______/
_ _ ____ _ _
| |__ ___ ___ | |_|___ \ __| | ___ ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__| < __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.12.0, build HEAD : e030bab - Fri Jul 29 00:29:14 UTC 2016
Docker version 1.12.0, build 8eab29e
docker@node3:~$ time curl http://192.168.99.100:30000;time curl http://192.168.99.102:30000;time curl http://192.168.9
9.104:30000
I'm a2531712ae05
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
I'm b0d6d239f694
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
I'm af339bc5e1bf
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
docker@node3:~$ exit
rogaha@Robertos-MacBook-Pro:~$ docker-machine ssh master2 8:29:48
## .
## ## ## ==
## ## ## ## ## ===
/"""""""""""""""""\___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\_______/
_ _ ____ _ _
| |__ ___ ___ | |_|___ \ __| | ___ ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__| < __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.12.0, build HEAD : e030bab - Fri Jul 29 00:29:14 UTC 2016
Docker version 1.12.0, build 8eab29e
docker@master2:~$ time curl http://192.168.99.100:30000;time curl http://192.168.99.102:30000;time curl http://192.168
.99.104:30000
I'm a2531712ae05
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
I'm b0d6d239f694
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
I'm af339bc5e1bf
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
docker@master2:~$ exit
rogaha@Robertos-MacBook-Pro:~$ docker-machine ssh master1 8:29:54
## .
## ## ## ==
## ## ## ## ## ===
/"""""""""""""""""\___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\_______/
_ _ ____ _ _
| |__ ___ ___ | |_|___ \ __| | ___ ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__| < __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 1.12.0, build HEAD : e030bab - Fri Jul 29 00:29:14 UTC 2016
Docker version 1.12.0, build 8eab29e
docker@master1:~$ time curl http://192.168.99.100:30000;time curl http://192.168.99.102:30000;time curl http://192.168
.99.104:30000
I'm a2531712ae05
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
I'm b0d6d239f694
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
I'm af339bc5e1bf
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s
docker@master1:~$ docker service ps service1
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
c83vrhg87a0hxpfo0te4weat6 service1.1 jwilder/whoami master1 Running Running 17 minutes ago
1vmmzhup16d5t5ounw0z06qqv service1.2 jwilder/whoami master2 Running Running 16 minutes ago
5uhwu39j5wjslsm9gz5b4242n service1.3 jwilder/whoami node3 Running Running 16 minutes ago
docker@master1:~$
Hi all,
I too have some issues with load balancer which suddenly stops working. My setup is a simple 1 server (CentOS) with Docker 1.12 installed. After a while, following simple play around actions caused it to stop working:
docker swarm init
docker service create --name web --publish 80:80 --replicas 2 nginxdemos/hello
...
docker service scale web=0
docker service scale web=20
docker service scale web=0
docker service scale web=5
docker service scale web=10
docker service scale web=15
docker stop 2a66b345100c
docker rm 2a66b345100c
docker stop 7c79846edf41 085ee0c5596f 165d88d0029b c68f1202d8bc ab78b5649915 debb3f7f5673 76347454844e
I tested it via 2 external servers with curl (watch -n1 "curl -s 10.3.x.x |grep -e 'My hostname|My address'") . First symptom was that the load balancer stopped "round robin" the containers, each curl stayed on the same container - this on all curl tests on all servers including curl on server itself. And then the load balancer stopped all together with timeouts resulting in all curl tests.
syslog at time it stopped: https://gist.github.com/erikrs/7c05940f9c1e98c15a41f367686aa517 and also docker info output
Some time after this, I scaled the swarm service again to 0, and again to 2. It then worked again.
manager0:~$ docker swarm init
Swarm initialized: current node (8bigilkl82ilyhxroro2rigm5) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-1896eccg5umy8f8uyq60nb4qanp93u56j99uyejpfhqybr1lya-bhl6g7n5ket0zpo4ml6apax5q \
10.99.10.176:2377
To add a manager to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-1896eccg5umy8f8uyq60nb4qanp93u56j99uyejpfhqybr1lya-2cxbei2xxdrzb87i4lxxi3rhd \
10.99.10.176:2377
ninja@manager0:~$ docker node update --availability drain `hostname`
manager0
ninja@node0:~$ docker swarm join \
> --token SWMTKN-1-1896eccg5umy8f8uyq60nb4qanp93u56j99uyejpfhqybr1lya-bhl6g7n5ket0zpo4ml6apax5q \
> 10.99.10.176:2377
This node joined a swarm as a worker.
ninja@node1:~$ docker swarm join \
> --token SWMTKN-1-1896eccg5umy8f8uyq60nb4qanp93u56j99uyejpfhqybr1lya-bhl6g7n5ket0zpo4ml6apax5q \
> 10.99.10.176:2377
This node joined a swarm as a worker.
ninja@manager0:~$ docker service create --name frontend --replicas 2 -p 80:8000/tcp jwilder/whoami
e1oezurvdk9fcvw594xnwn25b
ninja@manager0:~$ docker service ps frontend
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
1vdzpmsf5dvz74675cyz1mac5 frontend.1 jwilder/whoami node0 Running Starting less than a second ago
2lan6ycd2im5p5pl0oqq4vfoi frontend.2 jwilder/whoami node1 Running Starting less than a second ago
fitz123@fitz123-laptop:~$ for i in `seq 4`; do curl node0; done
I'm 64fd8496e5f7
I'm ecb291129b53
I'm 64fd8496e5f7
I'm ecb291129b53
fitz123@fitz123-laptop:~$ for i in `seq 4`; do curl node1; done
I'm ecb291129b53
I'm 64fd8496e5f7
I'm ecb291129b53
I'm 64fd8496e5f7
ninja@node0:~$ sudo reboot
ninja@manager0:~$ docker service ps frontend
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
bvsmq3pa5l1ifhctuset8dh5s frontend.1 jwilder/whoami node1 Running Running 16 seconds ago
1vdzpmsf5dvz74675cyz1mac5 \_ frontend.1 jwilder/whoami node0 Shutdown Complete 18 seconds ago
2lan6ycd2im5p5pl0oqq4vfoi frontend.2 jwilder/whoami node1 Running Running about a minute ago
ninja@manager0:~$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
6ytfk6umjiwxzk4su8qkyfcbn node0 Ready Active
8bigilkl82ilyhxroro2rigm5 * manager0 Ready Drain Leader
bl2a1ur0i9fhi3l7giia2fl2j node1 Ready Active
Result after node restart:
fitz123@fitz123-laptop:~$ for i in `seq 4`; do curl node1; done
I'm 0425a0bf58f3
I'm 64fd8496e5f7
I'm 0425a0bf58f3
I'm 64fd8496e5f7
fitz123@fitz123-laptop:~$ for i in `seq 4`; do curl --connect-timeout 2 node0; done
curl: (28) Connection timed out after 2000 milliseconds
curl: (28) Connection timed out after 2000 milliseconds
curl: (28) Connection timed out after 2001 milliseconds
curl: (28) Connection timed out after 2001 milliseconds
If I add 3rd container:
ninja@manager0:~$ docker service scale frontend=3
frontend scaled to 3
ninja@manager0:~$ docker service ps frontend
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
bvsmq3pa5l1ifhctuset8dh5s frontend.1 jwilder/whoami node1 Running Running about an hour ago
1vdzpmsf5dvz74675cyz1mac5 \_ frontend.1 jwilder/whoami node0 Shutdown Complete about an hour ago
2lan6ycd2im5p5pl0oqq4vfoi frontend.2 jwilder/whoami node1 Running Running about an hour ago
9k12yf0gcug7amd7564ookkfj frontend.3 jwilder/whoami node0 Running Preparing 2 seconds ago
Result after adding 3rd node looks like that:
fitz123@fitz123-laptop:~$ for i in `seq 4`; do curl node1; done
I'm 3fcc3b2b0e41
I'm 0425a0bf58f3
I'm 64fd8496e5f7
I'm 3fcc3b2b0e41
fitz123@fitz123-laptop:~$ for i in `seq 4`; do curl --connect-timeout 2 node0; done
curl: (28) Connection timed out after 2000 milliseconds
I'm 3fcc3b2b0e41
curl: (28) Connection timed out after 2001 milliseconds
curl: (28) Connection timed out after 2001 milliseconds
@mrjana @mavenugo is this issue resolved by https://github.com/docker/docker/pull/25603 ?
@thaJeztah this is one of the issues that is potentially solved via #25603. @mschirrmeister can you please confirm ?
I have the same issue! I tested with CentOS and Ubuntu nodes, same issue. Usually, issue occurs only on the node that is restarted. If I run "systemctl restart docker" after reboot, apparently resolve the issue for a moment, but issue return after some minutes.
I updated to 1.12.1-rc1. The package upgrade did also a restart of the service. The issue with the timeout, when accessing the service via http, was gone. But not all 3 backends answered when accessing via the Docker host ip address. On one host it was always going to the same backend. On another host it was load balancing between 2 backends.
I then did a full restart of the hosts and re-created the services. Access via curl works at the moment and all 3 backends on all 3 hosts worked. I will monitor the situation a little more to see if it will break again.
When it was not working after the upgrade, I connected to the container on the docker host where it was load balancing always to the same backend and did a dns lookup to tasks.service
and it only showed 1 ip address.
@mschirrmeister I have seen that same symptom on several occasions.... and I diagnosed it in the IPVS table not being populated correctly (why I don't know). To confirm the issue, cat /proc/net/ip_vs in the ingress-sbox namespace (that's the network namespace doing load balancing for requests coming from the outside). I.e.
cd /var/run/docker/netns/ ; nsenter --net=5683f2b6e546 cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 000001C2 rr
-> 0AFF0009:0000 Masq 1 0 0
-> 0AFF0008:0000 Masq 1 0 0
-> 0AFF0007:0000 Masq 1 0 0
Those last 3 lines are the hex-encoded IP of the containers being load balanced for the service. On several occasions that list of target containers was either incomplete or outdated.
Also, if you have multiple services defined already... you'll have several of these entries. This one was for service marked with FWM 0x01C2. Find what traffic this was for originally with:
cd /var/run/docker/netns/ ; nsenter --net=5683f2b6e546 iptables -t mangle -L -n -v
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
0 0 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 MARK set 0x1c2
So this service is the one with port 8080 published to the outside world
@somejfn This is exactly the problem that was fixed in 1.12.1-rc1 i.e incorrect backend information in ipvs. Are you using 1.12.1-rc1?
@mrjana Was on 1.12.0 with pre-built binaries at https://get.docker.com/builds/Linux/x86_64/docker-latest.tgz. Im on CoreOS (hence no a package manager) so I guess I'd need to build from source to get 1.12.1-rc1 until 1.12.1 is GA ?
@somejfn Yes, that's right.
I am definitely running 1.12.1-rc1.
# docker info | grep "Server Version"
Server Version: 1.12.1-rc1
I can confirm I still see the issue. I did today another reboot of all 3 docker hosts. Then started the docker daemon on all 3 hosts and the swarm cluster was back up and running
# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4oprp4u607bo98mxqlwvus881 azeausdockerapps303t.abc.foo.int Ready Active Reachable
5c8mpecymotn9rdvxb8hkh0tf azeausdockerapps302t.abc.foo.int Ready Active Leader
d7oq3rjt5llc47hr9wt19tood * azeausdockerapps301t.abc.foo.int Ready Active Reachable
I created then my service again with 1 replica. Scaled it to 3 and accessed it from my client. 1 host goes to only 1 backend. The other 2 hosts go each to 2 backends.
Host1
# nsenter --net=06bd65103713 cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 00000104 rr
-> 0AFF0007:0000 Masq 1 0 0
Host2/Host3 look like this.
# nsenter --net=9de86f1b7f9a cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 00000104 rr
-> 0AFF0008:0000 Masq 1 0 0
-> 0AFF0006:0000 Masq 1 0 0
When I do a service remove, the entry in /var/run/docker/netns
stays there, but it has of course no FWM entries. If I re-create the service, it gets filled with backends, but again with the wrong (to less) backends like above.
I have the same issue with 1.12.1-rc1.
Environment:
# docker version
Client:
Version: 1.12.1-rc1
# cat /etc/issue
Ubuntu 16.04.1 LTS \n \l
# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
2mzcl28jislbiafxll8lcwpnb cn07 Ready Active
7mhcf1199f48vm5vxfah7t7w2 * cn06 Ready Active Leader
Steps to reproduce:
docker service create --replicas 1 --publish 8080:80 --name vote instavote/vote
docker service scale vote=10
docker service scale vote=1
At this point, I can access the service through just one host. If I check the IPVSADMN, I can see the problem: NODE MASTER (working):
# nsenter --net=241278c9f76b sh
# iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 767 packets, 265K bytes)
pkts bytes target prot opt in out source destination
685 84104 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:9090 MARK set 0x100
242 29765 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 MARK set 0x101
Chain INPUT (policy ACCEPT 61 packets, 3820 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 706 packets, 262K bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 61 packets, 3680 bytes)
pkts bytes target prot opt in out source destination
0 0 MARK all -- * * 0.0.0.0/0 10.255.0.2 MARK set 0x100
0 0 MARK all -- * * 0.0.0.0/0 10.255.0.25 MARK set 0x101
Chain POSTROUTING (policy ACCEPT 767 packets, 265K bytes)
pkts bytes target prot opt in out source destination
# ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 256 rr
-> 10.255.0.12:0 Masq 1 0 0
FWM 257 rr
-> 10.255.0.32:0 Masq 1 0 0
IPVSADM is forwarding correctly to IPs 10.255.0.12 and 10.255.0.32.
NODE 2 (not working):
# nsenter --net=3feca1a6e851 sh
# iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 1426 packets, 359K bytes)
pkts bytes target prot opt in out source destination
690 70326 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:9090 MARK set 0x100
505 49345 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 MARK set 0x101
Chain INPUT (policy ACCEPT 152 packets, 10397 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 1274 packets, 349K bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 152 packets, 9277 bytes)
pkts bytes target prot opt in out source destination
0 0 MARK all -- * * 0.0.0.0/0 10.255.0.2 MARK set 0x100
0 0 MARK all -- * * 0.0.0.0/0 10.255.0.25 MARK set 0x101
Chain POSTROUTING (policy ACCEPT 1426 packets, 358K bytes)
pkts bytes target prot opt in out source destination
# ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 256 rr
-> 10.255.0.5:0 Masq 1 0 0
-> 10.255.0.8:0 Masq 1 0 0
-> 10.255.0.9:0 Masq 1 0 0
-> 10.255.0.12:0 Masq 1 0 1
-> 10.255.0.13:0 Masq 1 0 0
-> 10.255.0.16:0 Masq 1 0 0
-> 10.255.0.17:0 Masq 1 0 0
-> 10.255.0.20:0 Masq 1 0 0
-> 10.255.0.21:0 Masq 1 0 0
-> 10.255.0.24:0 Masq 1 0 0
FWM 257 rr
-> 10.255.0.26:0 Masq 1 0 0
-> 10.255.0.29:0 Masq 1 0 0
-> 10.255.0.30:0 Masq 1 0 1
-> 10.255.0.32:0 Masq 1 0 0
-> 10.255.0.33:0 Masq 1 0 0
-> 10.255.0.34:0 Masq 1 0 0
IPVSADM on the node 2 continue forwarding connections to old IP's (0.26, 0.29, 0.30 and etc.). IE, DownScale didn't updated ipvsadm.
@mschirrmeister @asmialoski I've run this scale up/down tests many times and I haven't seen any issues. Can you please post the daemon logs from nodes where you are having issues?
@mrjana Please, see logs in attachments.
Let me explain steps that I performed:
I have two nodes: 1 MASTER and 1 WORKER.
docker service create --replicas 1 --publish 9090:80 --name vote2 instavote/vote docker service create --replicas 1 --publish 8080:80 --name vote instavote/vote
docker service scale vote=3 docker service scale vote2=3
docker service scale vote=1 docker service scale vote2=1
Tks, logs.zip
worker-10-50-1-106 net # journalctl -fu docker
-- Logs begin at Tue 2016-06-28 20:43:18 CST. --
Aug 17 20:02:45 worker-10-50-1-106 bash[23256]: time="2016-08-17T20:02:45+08:00" level=info msg="Firewalld running: false"
Aug 17 20:03:44 worker-10-50-1-106 bash[23256]: time="2016-08-17T20:03:44.904706893+08:00" level=error msg="could not resolve peer \"10.255.0.9\": could not resolve peer: serf instance not initialized"
Aug 18 10:27:38 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:27:38.049230783+08:00" level=error msg="container status unavailable" error="context canceled" module=taskmanager task.id=4p13wqeythnxsyedi9l9xcdpc
Aug 18 10:27:39 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:27:39.162733023+08:00" level=info msg="Failed to delete real server 10.255.0.15 for vip 10.255.0.12 fwmark 276: no such file or directory"
Aug 18 10:27:39 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:27:39+08:00" level=info msg="Firewalld running: false"
Aug 18 10:30:38 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:30:38+08:00" level=info msg="Firewalld running: false"
Aug 18 10:30:39 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:30:39+08:00" level=info msg="Firewalld running: false"
Aug 18 10:30:39 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:30:39+08:00" level=info msg="Firewalld running: false"
Aug 18 10:30:39 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:30:39+08:00" level=info msg="Firewalld running: false"
Aug 18 10:40:33 worker-10-50-1-106 bash[23256]: time="2016-08-18T10:40:33.031375514+08:00" level=error msg="could not resolve peer \"10.255.0.9\": could not resolve peer: serf instance not initialized"
^C
worker-10-50-1-106 net # cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1068.6.0
VERSION_ID=1068.6.0
BUILD_ID=2016-07-12-1826
PRETTY_NAME="CoreOS 1068.6.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
worker-10-50-1-106 net # exit
exit
core@worker-10-50-1-106 ~ $ docker version
Client:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 23:54:00 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 23:54:00 2016
OS/Arch: linux/amd64
core@worker-10-50-1-106 ~ $ cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
core@worker-10-50-1-106 ~ $ docker service ls
ID NAME REPLICAS IMAGE COMMAND
5qld98b1o02z echo 1/1 dhub.yunpro.cn/shenshouer/echo
btgpv1p1tu5k z7a7e4ec7386b738 5/5 dhub.yunpro.cn/shenshouer/echo
core@worker-10-50-1-106 ~ $ docker service ps z7a7e4ec7386b738
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
e6c2bq2zbt7u7mz5po06vw8jg z7a7e4ec7386b738.1 dhub.yunpro.cn/shenshouer/echo worker-10-50-1-107 Running Running 25 minutes ago
0o9ecgb61ej6kw7cx4lrxm0ad z7a7e4ec7386b738.2 dhub.yunpro.cn/shenshouer/echo worker-10-50-1-104 Running Running 25 minutes ago
0tdh8ltcslyhq13yaijvc0y7e z7a7e4ec7386b738.3 dhub.yunpro.cn/shenshouer/echo worker-10-50-1-106 Running Running 25 minutes ago
e0pqh01d6shkgwina0xf39ego z7a7e4ec7386b738.4 dhub.yunpro.cn/shenshouer/echo worker-10-50-1-105 Running Running 25 minutes ago
bcfmpm5xxrn6dw5mh43irf928 z7a7e4ec7386b738.5 dhub.yunpro.cn/shenshouer/echo worker-10-50-1-103 Running Running 23 minutes ago
core@worker-10-50-1-106 ~ $ curl 10.50.1.106:30001
curl: (7) Failed to connect to 10.50.1.106 port 30001: Connection timed out
core@worker-10-50-1-106 ~ $ curl 10.50.1.105:30001
{"clientAddr":"10.255.0.8:35938","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/","RawPath":"","RawQuery":"","Fragment":""}}core@worker-10-50-1-106 ~ $ curl 10.50.1.108:30001
curl: (7) Failed to connect to 10.50.1.108 port 30001: Connection refused
core@worker-10-50-1-106 ~ $ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
46w7emjaro9n5wqsdvlb2axeo worker-10-50-1-102 Ready Active Reachable
5bvwh7kzlxxe2srss0samjsyt * worker-10-50-1-106 Ready Active Reachable
5xig0dyp26ebrxp1g4043fk9g worker-10-50-1-101 Ready Active Leader
6ftk92pv02y85vpfg70yeqan6 worker-10-50-1-107 Ready Active Reachable
7l65yv0zi3qbtojm01jqad6b9 worker-10-50-1-103 Ready Active Reachable
b2v8dld2zuqjgfqx3p1p6oxy1 worker-10-50-1-104 Ready Active Reachable
ezqa3ggqrg5e2w5d3mogmelxo worker-10-50-1-105 Ready Active Reachable
My logs are available here. https://gist.github.com/mschirrmeister/e1b86b93b4524066de7a06aee5bb80ef
What I did was again:
I have similar experiences. When starting with a fresh swarm and a freshly deployed stack (using "docker stack deploy") it works.
I don't do scaling but regularly redeploy services (using "docker stack deploy").
After each deploy of the same stack (with updated images) I get more problems accessing the containers. Sometimes I get connection refused but mostly I get "Connection timed out".
It might be of interest that I regularly restarts docker on the node where the deploy command is issued. (Swarm related commands starts to give "Error response from daemon: rpc error: code = 4 desc = context deadline exceeded". Restart of docker is the only way I've found to recover from this.)
Same here with 1.12.1
I had a 1 manager + 2 workers cluster.
I managed to get nginx service in a state (by doing nothing special, just scaling a bit and testing how VIP works) where it would only respond consistently from the node where it was actually running. Rest of the nodes usually did not respond (connection did not establish), but if I hit control+c to the curl and run the command again it usually responded right away.
Couldn't see anything interesting in the logs. I first restarted docker on all machines and that did not help. Then I issued reboot on all machines at the same time and that did solve the issue..
Service was created (and recreated couple of times) with docker service create
. To me it looks like iptables/vip layer was somehow not in sync.
After the reboot I haven't been able to recreate the problem.
I'm not exactly sure where to get the VIP's network namespace, but interestingly, no matter which namespace I list with nsenter (/var/run/docker/netns/*) running ipvsadm, I never get FWN 257 rr
and FWN 256 rr
listed together, and only get the last one (256) once (Just running a single service with 6 replications on 2 nodes with 1.12.1).
Would it indicate that it's not even trying to forward traffic to the second node then?
I'am got same issue with 1.12.1, setup 3 hosts with 1 manager, expose port 8090:
docker service create -p 8090:8090 --name monitor 10.21.49.64:5000/monitor
when service started, it can access 8090 on one node only, remove the service and create it again, it can access 8090 on two nodes sometimes. docker info:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 20
Server Version: 1.12.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 22
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay bridge null host
Swarm: active
NodeID: 9s4xcf48hykhly55y1g0py930
Is Manager: true
ClusterID: 3khrw4i372jm2fcsylpnl3wve
Managers: 1
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.21.49.64
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.19.0-66-generic
Operating System: Ubuntu 14.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.774 GiB
Name: SZX1000108768
ID: U63Q:22L2:Q2WJ:V7CN:ANGP:QT7I:SQOH:LZEW:J35R:DNTQ:U26H:R3SN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
10.21.49.64:5000
127.0.0.0/8
I guess it may be cause by the os kernel version too old, and then upgrade it to 4.2.0-42-generic, the problem disappears!
I am using 4.4.0-34-generic kernel version and I still have this issue.
I think this issue is critical and already see the fix in above reference. release 1.12.2? laugh...
I updated my test nodes to 1.12.1 final and can confirm the issue still exists. I can reproduce it.
Experiencing the same +1
@mschirrmeister @bitsofinfo @jmzwcn could you give https://github.com/docker/docker/pull/25962 a try ? as per one of the commits (libnetwork vendoring), it indicates this issue is resolved.
I have almost the same problem, that can be solved by restarting the node. The different is, I cannot actually access the service through the manager IP address https://github.com/docker/swarmkit/issues/1439 I meant, all the running containers.
However, when I to go to that node directly, and try to access by its docker0 IP address, all the containers responded me quite well.
In my test the problem is not fully resolved. I try to create/remove services with same published port. The published port doesn't get removed cleanly at docker service rm
. I think there is race condition on removing iptables entries and the next service creation.
Running the following script in a 3 node cluster may result in such error.
for i in `seq 1 4`
do
docker service create --name ftest -p 8021:80 dongluochen/nctest
sleep 20
curl 127.0.0.1:8021
docker service rm ftest
docker service create --name gtest -p 8021:80 dongluochen/nctest
sleep 20
curl 127.0.0.1:8021
docker service rm gtest
done
In a node's ingress sbox, I can find multiple entries for dpt:8021 when it fails.
root@ip-172-19-241-144:/# iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 22 packets, 1374 bytes)
pkts bytes target prot opt in out source destination
23 1426 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8021 MARK set 0x16e
22 1366 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8021 MARK set 0x18c
16 960 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8021 MARK set 0x1a2
10 692 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8022 MARK set 0x1a3
root@ip-172-19-241-144:/# ipvsadm -l -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 418 rr
-> 10.255.0.5:0 Masq 1 0 0
FWM 419 rr
-> 10.255.0.7:0 Masq 1 0 0
ubuntu@ip-172-19-241-144:~$ docker version
Client:
Version: 1.13.0-dev
API version: 1.25
Go version: go1.7
Git commit: bf0df06
Built: Fri Aug 26 00:14:08 2016
OS/Arch: linux/amd64
Server:
Version: 1.13.0-dev
API version: 1.25
Go version: go1.7
Git commit: bf0df06
Built: Fri Aug 26 00:14:08 2016
OS/Arch: linux/amd64
If I add sleep between service remove and next creation, I do not see the failure.
@mavenugo Is there already a prebuild binary for #25962? I have this week no time to try to build something from source.
I met the same problem. I recreated the service, but the dead entries still remain in IPVS table, even I forced the node left swarm or restarted the docker daemon. Maybe it's ok to remove theme one by one by using ipvsadm, but it's not possible in prod.
I'm not familiar with IPVS, I wonder should expire_nodest_conn be set to 1 to drop dead entries automatically ?
I am trying to do a long running test(in a 3 node cluster) with fixes in #25962 which scales a service up and down to see if there are any issues. After that I can post an experimental binary to whoever wants it.
@mrjana I can still reproduce the problem on with service create/rm. If I run the following script several times, I can get a failure.
PORT=8088
SERVICE=test
for i in `seq 1 4`
do
docker service create --name $SERVICE -p $PORT:80 dongluochen/nctest
sleep 15
echo "deleting service"
docker service rm $SERVICE
done
docker service create --name $SERVICE -p $PORT:80 dongluochen/nctest
Service availability is validated with the following script on any node in the cluster.
while true; do curl -s --show-error -I http://127.0.0.1:8088 | head -n 1; sleep 0.1; done
I'm running mrjana/docker@3ff7123.
ubuntu@ip-172-19-241-144:~$ docker version
Client:
Version: 1.13.0-dev
API version: 1.25
Go version: go1.7
Git commit: 3ff7123
Built: Wed Aug 31 00:33:18 2016
OS/Arch: linux/amd64
Server:
Version: 1.13.0-dev
API version: 1.25
Go version: go1.7
Git commit: 3ff7123
Built: Wed Aug 31 00:33:18 2016
OS/Arch: linux/amd64
@dongluochen I don't think this is a right validation. If you are trying to create and remove a service and also run curl
on the said service in parallel periodically the curl is bound to fail sometimes because the service is not available a during a fraction of time when the service is removed and getting re-added. What you did previously is the right way to validate service create/rm
@mrjana I'm validating after the last service create
, not in between. I rerun my test from fresh cluster (start docker after removing /var/run/docker
and /var/lib/docker
on all the nodes) with the following script. After the test run the service is up, but load balancer is not passing traffic properly.
PORT=8089
SERVICE=test
for i in `seq 1 40`
do
docker service create --name $SERVICE -p $PORT:80 dongluochen/nctest
docker service rm $SERVICE
done
docker service create --name $SERVICE -p $PORT:80 dongluochen/nctest
Here is the mangle
table from ingress sandbox on the node with the container running. Traffic to port 8089 are marked as 261 (0x105). But there is no such mark on ipvsadm
.
root@ip-172-19-241-144:/# iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 1837 packets, 110K bytes)
pkts bytes target prot opt in out source destination
4168 272K MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8089 MARK set 0x105
Chain INPUT (policy ACCEPT 1837 packets, 110K bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 1837 packets, 73480 bytes)
pkts bytes target prot opt in out source destination
0 0 MARK all -- * * 0.0.0.0/0 10.255.0.7 MARK set 0x106
Chain POSTROUTING (policy ACCEPT 1837 packets, 73480 bytes)
pkts bytes target prot opt in out source destination
root@ip-172-19-241-144:/# ipvsadm -l -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 262 rr
-> 10.255.0.8:0 Masq 1 0 0
@dongluochen Thanks for the clarification. In your previous script you had a 15sec sleep between the create and remove. In the new one you don't. Is that of importance in terms of reproducibility? I will try your latest script to see if I am able to repro.
I got similar problem with the previous script. The 15 seconds was to validate each service create was successful. The latest script makes reproduce fast.
@mavenugo @mrjana I cloned the docker master repo and applied the patches/commits from #25962 (Hope I did everything right). The build produced docker 1.13.0-dev.
With that version, my problem still exists. Reproduced with,
Did then a curl against all 3 docker hosts. 2 nodes balanced between 2 containers and the 3rd node went always to 1 container.
docker info
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 4
Server Version: 1.13.0-dev
Storage Driver: btrfs
Build Version: Btrfs v3.17
Library Version: 101
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host null overlay overlay
Swarm: active
NodeID: d7oq3rjt5llc47hr9wt19tood
Is Manager: true
ClusterID: 51zzdq5p2xe8otuwmbalyfy2t
Managers: 3
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.218.3.5
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.805 GiB
Name: azeausdockerapps301t.azr.omg.wpp
ID: LWMY:RHUH:JJ5O:OP6G:5LV5:7P7B:WI3W:2JMI:B7HY:EP6J:A7SW:DUX2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
docker version
# docker version
Client:
Version: 1.13.0-dev
API version: 1.25
Go version: go1.7
Git commit: 800d5f8
Built: Mon Sep 5 13:17:20 2016
OS/Arch: linux/amd64
Server:
Version: 1.13.0-dev
API version: 1.25
Go version: go1.7
Git commit: 800d5f8
Built: Mon Sep 5 13:17:20 2016
OS/Arch: linux/amd64
@mschirrmeister You seem to be having a basic issue with load balancing. There seems to be something unique in your environment. I would have to take a look at your hosts to see what's different. I know you offered to provide access to your machines. Is that still possible?
I have the same problem on 6 Raspberry Pi nodes on 1.12.1...
root@swarm00:/var/run/docker/netns# docker info
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 3
Server Version: 1.12.1
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host null overlay
Swarm: active
NodeID: 0p0khtm1r1o3qk6wn11ida254
Is Manager: true
ClusterID: f2jppvtyvx5nz5r46ljejh8cx
Managers: 1
Nodes: 6
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 192.168.144.80
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.11-v7+
Operating System: Raspbian GNU/Linux 8 (jessie)
OSType: linux
Architecture: armv7l
CPUs: 4
Total Memory: 925.5 MiB
Name: swarm00.dev
ID: K7WM:MPI7:FHJT:PDYF:BDRK:77MC:AJKA:DMTB:YH22:T6OO:7GQR:OXUH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpuset support
Insecure Registries:
127.0.0.0/8
root@swarm00:/var/run/docker/netns# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
0p0khtm1r1o3qk6wn11ida254 * swarm00.dev Ready Active Leader
17xld3ibyq36eo1g76gdbfw0v swarm02.dev Ready Active
6losgi6e2m62dam1ms5trn0tw swarm04.dev Ready Active
7bxv7npaiikjt4q3vsle09kmq swarm03.dev Ready Active
d8itn0shhggeys0dah0hubnjg swarm05.dev Ready Active
eg59x4humv78j3mnqm9aaa64z swarm01.dev Ready Active
This looks like it's trying to do round-robin but it can't find its way to the other nodes...
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
curl: (7) Failed to connect to 172.17.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
curl: (7) Failed to connect to 172.17.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
curl: (7) Failed to connect to 172.17.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
{"guid":"f432266e-23b7-46b3-bc18-29f7bd2deaf3","container":"3f6e68a79782"}root@swarm00:/var/run/docker/netns#
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
curl: (7) Failed to connect to 172.17.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
curl: (7) Failed to connect to 172.17.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
curl: (7) Failed to connect to 172.17.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
curl: (7) Failed to connect to 172.17.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.17.0.1:30000/guid
{"guid":"7c64cb9f-88b1-4a1c-ae75-392bfb1593a2","container":"3f6e68a79782"}root@swarm00:/var/run/docker/netns#
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
{"guid":"55c10ab5-bbc7-491c-aaed-def520e8a2c2","container":"3f6e68a79782"}root@swarm00:/var/run/docker/netns#
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
curl: (7) Failed to connect to 172.18.0.1 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 172.18.0.1:30000/guid
{"guid":"1f06d5ee-b892-4771-abd8-4ba4d5808850","container":"3f6e68a79782"}root@swarm00:/var/run/docker/netns#
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
{"guid":"e3179120-7bc7-4d33-a69c-cdb225cbe24d","container":"3f6e68a79782"}root@swarm00:/var/run/docker/netns#
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
curl: (7) Failed to connect to 192.168.144.80 port 30000: No route to host
root@swarm00:/var/run/docker/netns# curl 192.168.144.80:30000/guid
{"guid":"413fad44-b749-4084-8d12-383846b46ad7","container":"3f6e68a79782"}root@swarm00:/var/run/docker/netns#
Rebooting all nodes didn't help...
root@swarm00:/var/run/docker/netns# docker service ps service1
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
6ghzcz968ii8xn6sh5sccxdnv service1.1 alexellis2/guid-generator-arm:0.1 swarm03.dev Running Running 15 minutes ago
cpwn5oqdaii4kejakossruc05 \_ service1.1 alexellis2/guid-generator-arm:0.1 swarm03.dev Shutdown Complete 15 minutes ago
exz9hb8svag53la929ntbnsd5 service1.2 alexellis2/guid-generator-arm:0.1 swarm00.dev Running Running 15 minutes ago
146ik9at2wplzdjmbnilfbzqc \_ service1.2 alexellis2/guid-generator-arm:0.1 swarm00.dev Shutdown Complete 16 minutes ago
dww7em3z7whkou2y2i7bnhzjl service1.3 alexellis2/guid-generator-arm:0.1 swarm01.dev Running Running 15 minutes ago
ehvwqets4xuwwwyf04h16zxsa \_ service1.3 alexellis2/guid-generator-arm:0.1 swarm01.dev Shutdown Complete 16 minutes ago
8m998xf3zp09xkdord6x7drz8 service1.4 alexellis2/guid-generator-arm:0.1 swarm03.dev Running Running 15 minutes ago
amexws0gndc0cy1ovsljvfyih \_ service1.4 alexellis2/guid-generator-arm:0.1 swarm05.dev Shutdown Complete 15 minutes ago
ejp5ewej89udslvv5hpaqu7pv service1.5 alexellis2/guid-generator-arm:0.1 swarm02.dev Running Running 15 minutes ago
e4y7xfhi5i9sr9geem8r9ctbu \_ service1.5 alexellis2/guid-generator-arm:0.1 swarm04.dev Shutdown Complete 15 minutes ago
4rqkq8jocg5283hljqq81jxqr service1.6 alexellis2/guid-generator-arm:0.1 swarm02.dev Running Running 15 minutes ago
7vlmvgo4kkd9doqrbya7pp9tc \_ service1.6 alexellis2/guid-generator-arm:0.1 swarm02.dev Shutdown Complete 15 minutes ago
To me this looks like no network setup is being done...
root@swarm00:/var/run/docker/netns# docker network ls
NETWORK ID NAME DRIVER SCOPE
fc3149d14599 bridge bridge local
d09f7618ef12 docker_gwbridge bridge local
433c7dd36190 host host local
9sg8v29i00yi ingress overlay swarm
55e822266712 none null local
root@swarm00:/var/run/docker/netns# ls -l
total 0
-r--r--r-- 1 root root 0 Sep 8 09:32 09e75a3493ee
-r--r--r-- 1 root root 0 Sep 8 09:32 1-9sg8v29i00
-r--r--r-- 1 root root 0 Sep 8 09:32 11f554d1254e
root@swarm00:/var/run/docker/netns# nsenter --net=1-9sg8v29i00 cat /proc/net/ip_vs
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
root@swarm00:/var/run/docker/netns# nsenter --net=1-9sg8v29i00 iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 72 packets, 7056 bytes)
pkts bytes target prot opt in out source destination
Chain INPUT (policy ACCEPT 2 packets, 736 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 71 packets, 6688 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain POSTROUTING (policy ACCEPT 71 packets, 6688 bytes)
pkts bytes target prot opt in out source destination
Please let me know if I can provide more information.
As a question aside, why the difference between 9sg8v29i00yi
and 1-9sg8v29i00
?
I would like to add that I created the service with 6 replicas from the offing and the networking never worked. This isn't a case of scaling up or down after the service launch.
@darkermatter You most probably have a different problem. Since you are on RPi and if you are on raspbian distro I would check if you have vxlan module in you kernel by doing lsmod
. If you don't have it then that is likely your problem and if you are on raspbian distro you can get it by doing an rpi-update
I believe.
How is this closed? the OP @mschirrmeister never confirmed it is fixed.
@mrjana is this fixed by https://github.com/docker/docker/commit/99c39680984018b345881a29d77a89f87958a57b?
I will reopen the issue. This issue has become a kitchen sink issue for various different issues. For e.g the issue reported by @asmialoski in this thread is definitely fixed by https://github.com/docker/docker/commit/99c39680984018b345881a29d77a89f87958a57b. So I mentioned this issue in the commit logs of my PR which automatically closed this issue. But the original issue as reported by @mschirrmeister is probably not resolved yet. We can keep it open until that is resolved.
@asmialoski If you want to you can use docker/docker master build to verify if your issue is resolved now.
Hi,
I have a problem with the docker 1.12 swarm mode load balancing. The setup has 3 hosts, Docker 1.12 on CentOS 7 running in Azure. Nothing really special about the hosts. Plain CentOS 7 setup, Docker 1.12 from the Docker yum repo and btrfs as a data disk for
/var/lib/docker
.If I create 2 services, scale them to 3 and then try to access them from a client the access occasionally does not work. What it means is if you access the service via the docker host ip address(es) and exposed ports some containers do not respond.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.): Current test environment is running on Microsoft Azure
Steps to reproduce the issue: Create overlay network
Create services and scale them
docker service ls
docker service ps service1
docker service ps service2
docker service inspect service1
Access service1 from a client against docker host 1
Access service2 from a client against docker host 1
Access service1 from a client against docker host 2
Access service2 from a client against docker host 2
Access service1 from a client against docker host 3
Access service2 from a client against docker host 3
Describe the results you received: Not all containers respond when accessing the service via the docker host ip addresses and exposed ports.
Describe the results you expected: All containers from a service should respond no matter via which docker host the service is accessed.
Additional information you deem important (e.g. issue happens only occasionally): The issue is occasionally. Occasionally that if you delete and re-create the service maybe all containers respond, or containers on a different host do not respond.
It is at least consistent once a service is created. Lets say, containers on host 2 and host 3 do not respond when accessed via docker host 1, then it is always like this for the lifetime of that service.