Open yangm97 opened 6 years ago
what kernel do you use? I head no problem running swarm make sure you have the right kernel you can google for kubernetes/scaleway. Same kernel recommendations apply to swarm.
@Vad1mo I tried most of the available kernels, same results every time.
What's actually not working? What is the error message?
I had Swarm running without any problems a few months ago.
https://docs.docker.com/engine/swarm/ingress/#publish-a-port-for-a-service
Suppose the scaleway node is in the same situation as node3, it should forward the requests to a node running the published service. What happens is some error in the communication between nodes and the client connection times out.
when setting up swarm mode you should use the private IP address not the public ones.
You do need to use the public IP when your cluster is located on more than one datacenter. In my case, I set up a swarm between amazon, azure and scaleway instances. Communication between amazon and azure nodes went just fine.
yes because scaleway public IPs are NAT to private instance. This is a different to AWS/Azure. This might be the problem.
They all use NAT. And I disabled all the three firewalls.
When running nmap, I found something interesting: while amazon and azure nodes would always keep the communication port open, the scaleway node would inconsistently open and close said port. Actually, you might be able to get some traffic a few moments after restarting, but the port will shut closed soon enough.
do you see why the connection is closed? Any message/status code? HTTP or TCP level?
I might have overlooked it, but the only message I saw on the docker daemon which could be relevant was the single line on the op. Now I regret not dumping the full log, as I already teared down this setup (but would bring up again if needed).
+1 I am not even in any cloud - We have hosted private cluster and nodes has been joined over private network (not any NAT)
Apparently the issue is the kernel conf of the default images, there are missing modules that are needed for the overlay network. The “Failed to deserialize netlink ndmsg: Link not found"
is unrelated
See:
having exactly the same problem service ports are not published. Using the private IP also isn't ideal since if you reboot your server, you kill the swarm because the IP address get's changed. EDIT: (or poweroff / poweron)
BTW, to fix the network issue I had to change the bootscript to the 4.15.* one. The one with "Docker" in the name didn't work.
Tested all bootscripts available for the paris start1-xs instances. None will allow me to use swarm. It connects to the swarm, but none of the containers are reachable from any other node. I don't have this issue with any other of my swarm nodes.
Will this be fixed anytime? Because without swarm, the server is sadly useless to me.
Same problem here. Trying to change the bootscript don't work. Anyone has a fix ?
I have the same issue with Docker installed as one click app and previously with Docker installed on Debian. The version of Docker is 18.09.
I have one manager and 3 workers. My manager and two of the workers are in par1 zone. The last worker is an OVH server.
I created the swarm using this command :
docker swarm init --advertise-addr public_ip
.
Then all the workers joined the swarm with this command:
docker swarm join --token SWMTKN-1-TOKEN public_ip:2377
It worked and all my nodes are detected. My services are also distributed to the nodes. No problem.
If the services are on the same node, they can communicate. But if they are on different nodes, they can't. When I connect into a container, I see that, they can ping each other (using their service name or IP) but, as an example, mongo client cannot access a mongo database (connection time out). It's not just a problem with mongo as redis have the same problem.
I have tested all bootscripts but not one makes it work.
However, it's worth to note that I have this error logs on the node that have the mongo/redis database :
time="2019-02-03T18:03:08.231472040Z" level=info msg="API listen on /var/run/docker.sock"
time="2019-02-03T18:03:11.121914063Z" level=info msg="Node 8cebd88d9847/51.75.124.205, joined gossip cluster"
time="2019-02-03T18:03:11.122878954Z" level=info msg="Node 8cebd88d9847/51.75.124.205, added to nodes list"
time="2019-02-03T18:03:11.123329059Z" level=info msg="Node a37309267714/10.17.85.145, joined gossip cluster"
time="2019-02-03T18:03:11.124267840Z" level=info msg="Node a37309267714/10.17.85.145, added to nodes list"
time="2019-02-03T18:03:23.121109458Z" level=warning msg="memberlist: Refuting a suspect message (from: 8cebd88d9847)"
time="2019-02-03T18:05:03.150819926Z" level=warning msg="memberlist: Refuting a suspect message (from: 8cebd88d9847)"
time="2019-02-03T18:05:19.120830135Z" level=warning msg="memberlist: Refuting a suspect message (from: 8cebd88d9847)"
time="2019-02-03T18:06:38.186586864Z" level=error msg="Bulk sync to node 8cebd88d9847 timed out"
I googled about it but found no answer.
Moreover, I can't exactly understand what is from:8cebd88d9847
in the previous command.
Does anyone have a fix for this ? Or an idea ? I said earlier, without swarm the server is useless for me. I also already searched for this error for more than 20 hours.
Here is the result of Docker info :
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 2
Server Version: 18.09.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: 8shuebl5wzitghr8244dpgo5y
Is Manager: true
ClusterID: s6rear2esk0swp9f1acbgg6vt
Managers: 1
Nodes: 4
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 1
Autolock Managers: false
Root Rotation In Progress: false
Node Address: public_ip.136
Manager Addresses:
public_ip:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.14.33-mainline-rev1
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.95GiB
Name: server-name
ID: XJ6Y:DQDH:NOQ5:IRTF:BFYM:JVNB:3764:WDWX:GIUN:WYKN:USZ5:ZXTI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
@lauevrar77 never find any fix for this issue, and i think actually scaleway don't care about docker swarm, they try to promote kubernetes instead.
Same here @lauevrar77. I moved to Hetzner cloud, everything working flawlessly.
same here :'(
The only relevant log I could find on docker was
Jan 17 20:37:33 scw-f118f1 dockerd[3495]: time=“2018-01-17T20:37:33.047047682Z” level=error msg=“Failed to deserialize netlink ndmsg: Link not found"
Steps to reproduce: