Open saithala opened 8 years ago
I meet the same problem ,is there someone could explain why the service create via swarm could not access from the node without running containers ?
Suspected reason is that ingress network not sync to that node when no containers got running.
@saithala Could you please retry on 1.12.2? Thanks
@aluzzardi no luck on 1.12.2 too.
@saithala can you verify if the required ports for overlay networking are opened, and accessible between nodes? https://docs.docker.com/engine/swarm/swarm-tutorial/#/open-ports-between-the-hosts
Also may be worth checking if a dependency / configuration is missing on your nodes using the check-config script https://github.com/docker/docker/blob/master/contrib/check-config.sh
I'm starting to run into this. Except in our case, this used to work up until this morning. Suddenly, some of the ports on other nodes are not accessible. Running docker 1.12.2. I've verified that the necessary ports are open and accessible between nodes.
Some of our services are still accessible from other nodes while others aren't
docker info
Containers: 15
Running: 5
Paused: 0
Stopped: 10
Images: 28
Server Version: 1.12.2
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 192
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge overlay null
Swarm: active
NodeID: 6t1dwdfrcosk816bf5oxxkzei
Is Manager: true
ClusterID: 2989ooasbl08rjlotjxx2yc6x
Managers: 3
Nodes: 3
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.10.14.39
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-36-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.42 GiB
Name: ip-10-10-14-39
ID: H4KX:W7CG:C5P7:ABZY:QM3I:QHRE:KX3R:VPUE:AOMI:I65K:ID27:HEPN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
@nsamala can you check daemon logs of each node, to see if something stands out?
Sorry, I should've updated this! I found that we were using Kong and I had unnecessarily published port 7946. This is what caused our issue as far as I can tell.
Only discovered this after starting a new swarm and turning services on one by one. Haven't had the problem since.
@nsamala no problem, that's good news :smile:
I'm having this exact same issue. I posted on StackOverflow but I'll copy it here:
I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12. Using the master node I have created a service with the following
docker service create --replicas 1 --name myapp -p 3000 myapp
When I run docker service ps myapp
I get the following output
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR
02awst8p9pezgpkfzqgz8z79t myapp.1 myapp:latest swarm-node-01 Running Running 19 minutes ago
The running task is deployed to swarm-node-01.
I checked the auto-selected port which was published publicly
$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000
According to the documentation:
External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.
But when I try to curl the nodes who do not have the task running I'm getting connection refused
.
$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}
$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused
_note: I scrubbed the IP of node-02_
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 5
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 28
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay host bridge null
Swarm: active
NodeID: 8hhpc2i8pxpmrm6feoyegjad8
Is Manager: true
ClusterID: 4tae6frmxphphh6mmt1f3xg9h
Managers: 1
Nodes: 3
Orchestration:
Task History Retention Limit: 3
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.0.0.12
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.2.0-18-generic
Operating System: Ubuntu 15.10
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 991.2 MiB
Name: swarm-master-01
ID: UGF3:ZDZY:QK3Q:AXIK:5LF7:K7X6:22SF:ITPU:LV3D:HFLO:ZHEF:3ERW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
provider=amazonec2
Insecure Registries:
127.0.0.0/8
@blockloop did you check my previous comment https://github.com/docker/docker/issues/27237#issuecomment-254006465
I just found this error on the nodes syslog
Nov 9 15:37:54 ubuntu docker[23092]: time="2016-11-09T15:37:54.494743354Z" level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
So, yes, looks like the required ports for overlay networking are not opened between your nodes
The security group in AWS is set to allow all traffic from within the same SG.
Running this on master shows that that port isn't listening
ubuntu@swarm-master-01:~$ sudo lsof -i :7946
ubuntu@swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu@swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused
Found this in the syslog of the master but I don't remember trying to explicitly remove the ingress network. Perhaps this is an internal process
Nov 8 21:44:03 ubuntu docker[24901]: time="2016-11-08T21:44:03.724145514Z" level=error msg="remove task failed" error="network ingress not found" module=taskmanager task.id=1w7yas48o7zq64umfersauxo8
I believe the culprit is the master who does not appear to be listening on port 7946. Network tools show that 7946 is listening on the nodes, but not the master.
$ docker-machine ssh swarm-node-01 nc -zv 0.0.0.0 7946
Connection to 0.0.0.0 7946 port [tcp/*] succeeded!
$ docker-machine ssh swarm-node-02 nc -zv 0.0.0.0 7946
Connection to 0.0.0.0 7946 port [tcp/*] succeeded!
$ docker-machine ssh swarm-master-01 nc -zv 0.0.0.0 7946
nc: connect to 0.0.0.0 port 7946 (tcp) failed: Connection refused
exit status 1
When I check the syslogs for the nodes I see the following error
level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"
When you initialized the swarm, did you provide a --advertise-addr
(and possibly --listen-addr
)?
I provided both as 10.0.0.12:2377
@mavenugo any thoughts?
I was able to get around the issue for now, but I'd like to know how to figure out why the overlay network port wasn't listening on master-01. Here is what I did to circumvent the issue:
Now all of the machines are working as expected except for master-01. One task is running on node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster and joining it again as a master. Now 7946 is listening on that machine.
All nodes are running 1.12.3?
Yes
Description
A service running in one node with a published port is not accessible when accessed with the same port on another node. I have 4 nodes in the cluster with 3 managers and 1 worker
Steps to reproduce the issue:
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):