Docker Engine Swarm Mode + Load Balancing

saithala commented 7 years ago

Description

A service running in one node with a published port is not accessible when accessed with the same port on another node. I have 4 nodes in the cluster with 3 managers and 1 worker

Steps to reproduce the issue:

Deploy a service (web app) with a port (8080) published to the swarm.
The scheduler schedules it in node 1. The service (web app) is accessible from a browser at node1:8080
When I tried to access the service using node2:8080, the service (app) does not come up.
as per the docs, Docker Swarm automatically makes a published port available on ALL nodes, whether or not the service is running locally on that node. Is this not true? if yes any idea why its not working for me?

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 209
Server Version: 1.12.1
Storage Driver: devicemapper
 Pool Name: docker-253:0-204218855-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 5.067 GB
 Data Space Total: 107.4 GB
 Data Space Available: 31.06 GB
 Metadata Space Used: 8.471 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.139 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host null overlay
Swarm: active
 NodeID: 3y23oi8zayd4r5ehs8sq2438q
 Is Manager: true
 ClusterID: eytqq32a3nwwkal9iv6i936jd
 Managers: 3
 Nodes: 4
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.1.2.56
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 47.01 GiB
Name: cmcmdmz101
ID: N6XG:EQOG:FPUM:GYCY:HXHQ:DOVU:AHFM:TFBB:4GM6:WKC3:QCN2:6RLP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: cmcdocker
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.):

LamCiuLoeng commented 7 years ago

I meet the same problem ,is there someone could explain why the service create via swarm could not access from the node without running containers ?

cmingxu commented 7 years ago

Suspected reason is that ingress network not sync to that node when no containers got running.

aluzzardi commented 7 years ago

@saithala Could you please retry on 1.12.2? Thanks

saithala commented 7 years ago

@aluzzardi no luck on 1.12.2 too.

thaJeztah commented 7 years ago

@saithala can you verify if the required ports for overlay networking are opened, and accessible between nodes? https://docs.docker.com/engine/swarm/swarm-tutorial/#/open-ports-between-the-hosts

Also may be worth checking if a dependency / configuration is missing on your nodes using the check-config script https://github.com/docker/docker/blob/master/contrib/check-config.sh

outlandnish commented 7 years ago

I'm starting to run into this. Except in our case, this used to work up until this morning. Suddenly, some of the ports on other nodes are not accessible. Running docker 1.12.2. I've verified that the necessary ports are open and accessible between nodes.

Some of our services are still accessible from other nodes while others aren't

docker info

Containers: 15
 Running: 5
 Paused: 0
 Stopped: 10
Images: 28
Server Version: 1.12.2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 192
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge overlay null
Swarm: active
 NodeID: 6t1dwdfrcosk816bf5oxxkzei
 Is Manager: true
 ClusterID: 2989ooasbl08rjlotjxx2yc6x
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.10.14.39
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-36-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.42 GiB
Name: ip-10-10-14-39
ID: H4KX:W7CG:C5P7:ABZY:QM3I:QHRE:KX3R:VPUE:AOMI:I65K:ID27:HEPN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

thaJeztah commented 7 years ago

@nsamala can you check daemon logs of each node, to see if something stands out?

outlandnish commented 7 years ago

Sorry, I should've updated this! I found that we were using Kong and I had unnecessarily published port 7946. This is what caused our issue as far as I can tell.

Only discovered this after starting a new swarm and turning services on one by one. Haven't had the problem since.

thaJeztah commented 7 years ago

@nsamala no problem, that's good news :smile:

blockloop commented 7 years ago

I'm having this exact same issue. I posted on StackOverflow but I'll copy it here:

I am running a three node Swarm Mode cluster in AWS; one master and two workers. This is swarm mode not to be confused with docker swarm from pre 1.12. Using the master node I have created a service with the following

docker service create --replicas 1 --name myapp -p 3000 myapp

When I run docker service ps myapp I get the following output

ID                         NAME     IMAGE         NODE             DESIRED STATE  CURRENT STATE            ERROR
02awst8p9pezgpkfzqgz8z79t  myapp.1  myapp:latest  swarm-node-01    Running        Running 19 minutes ago

The running task is deployed to swarm-node-01.

I checked the auto-selected port which was published publicly

$ docker service inspect myapp | jq .[].Endpoint.Ports[].PublishedPort
30000

According to the documentation:

External components, such as cloud load balancers, can access the service on the PublishedPort of any node in the cluster whether or not the node is currently running the task for the service. All nodes in the swarm route ingress connections to a running task instance.

But when I try to curl the nodes who do not have the task running I'm getting connection refused.

$ curl $(docker-machine ip swarm-node-01):30000/stats
{"uptime":"2016-11-09T14:48:35Z","requestCount":7,"statuses":{"200":7},"pid":1,"open_db_conns":0}

$ curl $(docker-machine ip swarm-node-02):30000/stats
curl: (7) Failed to connect to [the IP] port 30000: Connection refused

_note: I scrubbed the IP of node-02_

Docker Info

Containers: 2
 Running: 1
 Paused: 0
 Stopped: 1
Images: 5
Server Version: 1.12.3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 28
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: overlay host bridge null
Swarm: active
 NodeID: 8hhpc2i8pxpmrm6feoyegjad8
 Is Manager: true
 ClusterID: 4tae6frmxphphh6mmt1f3xg9h
 Managers: 1
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 3
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.0.12
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.2.0-18-generic
Operating System: Ubuntu 15.10
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 991.2 MiB
Name: swarm-master-01
ID: UGF3:ZDZY:QK3Q:AXIK:5LF7:K7X6:22SF:ITPU:LV3D:HFLO:ZHEF:3ERW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 provider=amazonec2
Insecure Registries:
 127.0.0.0/8

Troubleshooting

The nodes are both properly connected to the swarm
Scaling the service up to 5 (which inherently deploys the task to every node) makes curl work on every node, because the task is deployed to every node.
Scaling the service down to 0 and then back up causes the task to switch to another node which makes curl work on the new node but stop working on the previous node
Removing and recreating the service changes nothing

thaJeztah commented 7 years ago

@blockloop did you check my previous comment https://github.com/docker/docker/issues/27237#issuecomment-254006465

blockloop commented 7 years ago

I just found this error on the nodes syslog

Nov  9 15:37:54 ubuntu docker[23092]: time="2016-11-09T15:37:54.494743354Z" level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"

thaJeztah commented 7 years ago

So, yes, looks like the required ports for overlay networking are not opened between your nodes

blockloop commented 7 years ago

The security group in AWS is set to allow all traffic from within the same SG.

blockloop commented 7 years ago

Running this on master shows that that port isn't listening

ubuntu@swarm-master-01:~$ sudo lsof -i :7946
ubuntu@swarm-master-01:~$ cat < /dev/tcp/10.0.0.12/7946
-bash: connect: Connection refused
-bash: /dev/tcp/10.0.0.12/7946: Connection refused
ubuntu@swarm-master-01:~$ cat < /dev/tcp/0.0.0.0/7946
-bash: connect: Connection refused
-bash: /dev/tcp/0.0.0.0/7946: Connection refused

Found this in the syslog of the master but I don't remember trying to explicitly remove the ingress network. Perhaps this is an internal process

Nov  8 21:44:03 ubuntu docker[24901]: time="2016-11-08T21:44:03.724145514Z" level=error msg="remove task failed" error="network ingress not found" module=taskmanager task.id=1w7yas48o7zq64umfersauxo8

blockloop commented 7 years ago

I believe the culprit is the master who does not appear to be listening on port 7946. Network tools show that 7946 is listening on the nodes, but not the master.

$ docker-machine ssh swarm-node-01 nc -zv 0.0.0.0 7946
Connection to 0.0.0.0 7946 port [tcp/*] succeeded!

$ docker-machine ssh swarm-node-02 nc -zv 0.0.0.0 7946
Connection to 0.0.0.0 7946 port [tcp/*] succeeded!

$ docker-machine ssh swarm-master-01 nc -zv 0.0.0.0 7946
nc: connect to 0.0.0.0 port 7946 (tcp) failed: Connection refused
exit status 1

When I check the syslogs for the nodes I see the following error

level=error msg="Failed to join memberlist [10.0.0.12] on retry: 1 error(s) occurred:\n\n* Failed to join 10.0.0.12: dial tcp 10.0.0.12:7946: getsockopt: connection refused"

thaJeztah commented 7 years ago

When you initialized the swarm, did you provide a --advertise-addr (and possibly --listen-addr)?

blockloop commented 7 years ago

I provided both as 10.0.0.12:2377

thaJeztah commented 7 years ago

@mavenugo any thoughts?

blockloop commented 7 years ago

I was able to get around the issue for now, but I'd like to know how to figure out why the overlay network port wasn't listening on master-01. Here is what I did to circumvent the issue:

created another node with docker-machine called swarm-master-02
joined swarm-master-02 to the cluster as a master
demoted master-01 which set master-02 as the leader
restarted the docker daemon on each node (might not have been necessary)

Now all of the machines are working as expected except for master-01. One task is running on node-01 and curl works against all nodes by forwarding the traffic to the proper container on the proper node. However, swarm-master-01 refuses to listen on the overlay network and curl does not work against this node. I was only able to fix swarm-master-01 by completely removing it from the cluster and joining it again as a master. Now 7946 is listening on that machine.

thaJeztah commented 7 years ago

All nodes are running 1.12.3?

blockloop commented 7 years ago

Yes

moby / moby

Docker Engine Swarm Mode + Load Balancing #27237

Docker Info

Troubleshooting