Docker swarm mesh networking not working

yangm97 commented 6 years ago

The only relevant log I could find on docker was Jan 17 20:37:33 scw-f118f1 dockerd[3495]: time=“2018-01-17T20:37:33.047047682Z” level=error msg=“Failed to deserialize netlink ndmsg: Link not found"

Steps to reproduce:

init a swarm
join on scaleway node
deploy something like docker service create \ --name my-web \ --publish published=8080,target=80 \ --replicas 1 \ nginx
try to curl from a scaleway host which doesn’t have a copy of the service running
curl will hang there, then say it has timed out

Vad1mo commented 6 years ago

what kernel do you use? I head no problem running swarm make sure you have the right kernel you can google for kubernetes/scaleway. Same kernel recommendations apply to swarm.

yangm97 commented 6 years ago

@Vad1mo I tried most of the available kernels, same results every time.

Vad1mo commented 6 years ago

What's actually not working? What is the error message?

I had Swarm running without any problems a few months ago.

yangm97 commented 6 years ago

https://docs.docker.com/engine/swarm/ingress/#publish-a-port-for-a-service

Suppose the scaleway node is in the same situation as node3, it should forward the requests to a node running the published service. What happens is some error in the communication between nodes and the client connection times out.

Vad1mo commented 6 years ago

when setting up swarm mode you should use the private IP address not the public ones.

yangm97 commented 6 years ago

You do need to use the public IP when your cluster is located on more than one datacenter. In my case, I set up a swarm between amazon, azure and scaleway instances. Communication between amazon and azure nodes went just fine.

Vad1mo commented 6 years ago

yes because scaleway public IPs are NAT to private instance. This is a different to AWS/Azure. This might be the problem.

yangm97 commented 6 years ago

They all use NAT. And I disabled all the three firewalls.

When running nmap, I found something interesting: while amazon and azure nodes would always keep the communication port open, the scaleway node would inconsistently open and close said port. Actually, you might be able to get some traffic a few moments after restarting, but the port will shut closed soon enough.

Vad1mo commented 6 years ago

do you see why the connection is closed? Any message/status code? HTTP or TCP level?

yangm97 commented 6 years ago

I might have overlooked it, but the only message I saw on the docker daemon which could be relevant was the single line on the op. Now I regret not dumping the full log, as I already teared down this setup (but would bring up again if needed).

doublemcz commented 6 years ago

+1 I am not even in any cloud - We have hosted private cluster and nodes has been joined over private network (not any NAT)

piec commented 6 years ago

Apparently the issue is the kernel conf of the default images, there are missing modules that are needed for the overlay network. The “Failed to deserialize netlink ndmsg: Link not found" is unrelated See:

r3pek commented 6 years ago

having exactly the same problem service ports are not published. Using the private IP also isn't ideal since if you reboot your server, you kill the swarm because the IP address get's changed. EDIT: (or poweroff / poweron)

r3pek commented 6 years ago

BTW, to fix the network issue I had to change the bootscript to the 4.15.* one. The one with "Docker" in the name didn't work.

Johannestegner commented 6 years ago

Tested all bootscripts available for the paris start1-xs instances. None will allow me to use swarm. It connects to the swarm, but none of the containers are reachable from any other node. I don't have this issue with any other of my swarm nodes.

Will this be fixed anytime? Because without swarm, the server is sadly useless to me.

flolivaud commented 6 years ago

Same problem here. Trying to change the bootscript don't work. Anyone has a fix ?

lauevrar77 commented 5 years ago

I have the same issue with Docker installed as one click app and previously with Docker installed on Debian. The version of Docker is 18.09.

I have one manager and 3 workers. My manager and two of the workers are in par1 zone. The last worker is an OVH server.

I created the swarm using this command : docker swarm init --advertise-addr public_ip.

Then all the workers joined the swarm with this command: docker swarm join --token SWMTKN-1-TOKEN public_ip:2377

It worked and all my nodes are detected. My services are also distributed to the nodes. No problem.

If the services are on the same node, they can communicate. But if they are on different nodes, they can't. When I connect into a container, I see that, they can ping each other (using their service name or IP) but, as an example, mongo client cannot access a mongo database (connection time out). It's not just a problem with mongo as redis have the same problem.

I have tested all bootscripts but not one makes it work.

However, it's worth to note that I have this error logs on the node that have the mongo/redis database :

time="2019-02-03T18:03:08.231472040Z" level=info msg="API listen on /var/run/docker.sock"
time="2019-02-03T18:03:11.121914063Z" level=info msg="Node 8cebd88d9847/51.75.124.205, joined gossip cluster"
time="2019-02-03T18:03:11.122878954Z" level=info msg="Node 8cebd88d9847/51.75.124.205, added to nodes list"
time="2019-02-03T18:03:11.123329059Z" level=info msg="Node a37309267714/10.17.85.145, joined gossip cluster"
time="2019-02-03T18:03:11.124267840Z" level=info msg="Node a37309267714/10.17.85.145, added to nodes list"
time="2019-02-03T18:03:23.121109458Z" level=warning msg="memberlist: Refuting a suspect message (from: 8cebd88d9847)"
time="2019-02-03T18:05:03.150819926Z" level=warning msg="memberlist: Refuting a suspect message (from: 8cebd88d9847)"
time="2019-02-03T18:05:19.120830135Z" level=warning msg="memberlist: Refuting a suspect message (from: 8cebd88d9847)"
time="2019-02-03T18:06:38.186586864Z" level=error msg="Bulk sync to node 8cebd88d9847 timed out"

I googled about it but found no answer. Moreover, I can't exactly understand what is from:8cebd88d9847 in the previous command.

Does anyone have a fix for this ? Or an idea ? I said earlier, without swarm the server is useless for me. I also already searched for this error for more than 20 hours.

Here is the result of Docker info :

Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 18.09.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 8shuebl5wzitghr8244dpgo5y
 Is Manager: true
 ClusterID: s6rear2esk0swp9f1acbgg6vt
 Managers: 1
 Nodes: 4
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 1
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: public_ip.136
 Manager Addresses:
  public_ip:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.14.33-mainline-rev1
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.95GiB
Name: server-name
ID: XJ6Y:DQDH:NOQ5:IRTF:BFYM:JVNB:3764:WDWX:GIUN:WYKN:USZ5:ZXTI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

flolivaud commented 5 years ago

@lauevrar77 never find any fix for this issue, and i think actually scaleway don't care about docker swarm, they try to promote kubernetes instead.

luisfavila commented 5 years ago

Same here @lauevrar77. I moved to Hetzner cloud, everything working flawlessly.

wc-matteo commented 5 years ago

same here :'(

scaleway / kernel-tools

Docker swarm mesh networking not working #372