moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.34k stars 611 forks source link

Docker swarm join-token empty #2651

Open burgoyn1 opened 6 years ago

burgoyn1 commented 6 years ago

I ran into a really weird issue last night with our docker swarm instance. One of our servers filled its disk space with logs which caused one of our servers to fully fill up (this caused the issue). Once that happened, our entire docker swarm system stopped responding, all workers and managers to a point they would not communicate with each other and I kept getting the error not enough managers online. I checked to make sure the managers were all online and could communicate and they could. At that point I figured it would probably be good to just reset the quorum, so I ran the command docker swarm init --force-new-cluster --advertise-addr node01:2377 (with the proper credentials). It said it was successful and to promote other workers to managers, but it would not let me. When I ran the command docker swarm join-token worker, nothing was returned. It is like the key used to connect all the servers to the swarm just disappeared. In the end I had to clear our the swarm completely and re-build every single service. Has anyone experienced this before? I still have a few servers from the old swarm if more info is needed to help debug this.

dperny commented 6 years ago

First off, it wasn't docker itself that filled up the logs, right?

Second, my understanding is that the issue is this:

  1. Only one of many manager nodes runs out of disk space
  2. Quorum in the cluster is lost, because there aren't enough managers online. Unexpected, because there was only one manager malfunctioning.
  3. A new cluster was forced.
  4. No join token was produced.

Is this the correct sequence of events?

It would be helpful, if there's not likely to be sensitive information, to get the output of the docker daemon logs that have module= in them, which would indicate that they're swarmkit logs.

burgoyn1 commented 6 years ago

Yes, docker did not fill up the logs but the issue came up when the server ran out of disk space. That is also the correct sequence and the containers in the swarm all showed as Offline.

There are a lot of logs, but these are the ones which I believe you are looking for:

May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.438838561Z" level=info msg="Stopping manager" module=node node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.438897453Z" level=info msg="shutting down certificate renewal routine" module=node/tls node.id=pxh1bm6nnfzq7ijfdc76jz4d5 node.role=swarm-manager May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.439483476Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(Dispatcher).Session" node.id=pxh1bm6nnfzq7ijfdc76jz4d5 node.session=qi6shhtlv2skmlxwzddq8adk6 May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.440260536Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Internal desc = transport is closing" May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.441433672Z" level=info msg="Manager shut down" module=node node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.442088111Z" level=info msg="Node ab9e3da49516/10.1.0.3, left gossip cluster" May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.442150307Z" level=info msg="Node ab9e3da49516 change state NodeActive --> NodeFailed" May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.442271082Z" level=info msg="Node ab9e3da49516/10.1.0.3, added to failed nodes list" May 30 04:06:47 cersei-3 kernel: IPVS: __ip_vs_del_service: enter May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.449462829Z" level=info msg="Listening for connections" addr="[::]:2377" module=node node.id=pxh1bm6nnfzq7ijfdc76jz4d5 proto=tcp May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.449691254Z" level=info msg="Listening for local connections" addr=/var/run/docker/swarm/control.sock module=node node.id=pxh1bm6nnfzq7ijfdc76jz4d5 proto=unix May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.553591697Z" level=info msg="7590be75ae7e55ec became follower at term 2913" module=raft node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:47 cersei-3 dockerd[4658]: time="2018-05-30T04:06:47.553676299Z" level=info msg="newRaft 7590be75ae7e55ec [peers: [1e855842b436828,16cde2c02371a125,1a4d5973c5e2d058,2c9da0b7b93c5fd2,30e5d1a1784d2ff5,40b4ae441823ec37,7e4f6420fe54d975], term: 2913, commit: 1840979, applied: 1840000, lastindex: 1840979, lastterm: 2913]" module=raft node.id=pxh1bm6nn May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.050197438Z" level=info msg="7590be75ae7e55ec is starting a new election at term 2913" module=raft node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.050640787Z" level=info msg="7590be75ae7e55ec became candidate at term 2914" module=raft node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.050872095Z" level=info msg="7590be75ae7e55ec received MsgVoteResp from 7590be75ae7e55ec at term 2914" module=raft node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.051087321Z" level=info msg="7590be75ae7e55ec became leader at term 2914" module=raft node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.051293799Z" level=info msg="raft.node: 7590be75ae7e55ec elected leader 7590be75ae7e55ec at term 2914" module=raft node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.052175005Z" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.052225904Z" level=error msg="failed to create predefined network bridge" error="name conflicts with an existing object" module=node node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.052310807Z" level=error msg="failed to create predefined network host" error="name conflicts with an existing object" module=node node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.138459843Z" level=error msg="failed allocating task z50tmwzkuyj0m3t4pduz9f284 during init" error="failed to allocate network IP for task z50tmwzkuyj0m3t4pduz9f284 network 3hmrfrizrkfzgnfkpxp6qqt91: could not allocate IP from IPAM: Address already in use" module=node node.id=pxh1bm6nnfzq7ijfdc76 May 30 04:06:48 cersei-3 dockerd[4658]: time="2018-05-30T04:06:48.171071192Z" level=error msg="failed allocating task z50tmwzkuyj0m3t4pduz9f284 during init" error="failed to allocate network IP for task z50tmwzkuyj0m3t4pduz9f284 network 3hmrfrizrkfzgnfkpxp6qqt91: could not allocate IP from IPAM: Address already in use" module=node node.id=pxh1bm6nnfzq7ijfdc76 May 30 04:06:49 cersei-3 dockerd[4658]: time="2018-05-30T04:06:49Z" level=info msg="shim reaped" id=dc4f4266dbdf13c2e6ec82bd8023b49b5ff459c07eb895e18fe5c8bf9aebf1b3 module="containerd/tasks" May 30 04:06:49 cersei-3 dockerd[4658]: time="2018-05-30T04:06:49.520590420Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete" May 30 04:06:49 cersei-3 kernel: docker0: port 2(veth750fee2) entered disabled state May 30 04:06:49 cersei-3 kernel: veth73e4929: renamed from eth0 May 30 04:06:49 cersei-3 kernel: docker0: port 2(veth750fee2) entered disabled state May 30 04:06:49 cersei-3 kernel: device veth750fee2 left promiscuous mode May 30 04:06:49 cersei-3 kernel: docker0: port 2(veth750fee2) entered disabled state May 30 04:06:49 cersei-3 kernel: br0: port 2(veth0) entered disabled state May 30 04:06:49 cersei-3 kernel: br0: port 1(vxlan0) entered disabled state May 30 04:06:49 cersei-3 kernel: ov-001001-3hmrf: renamed from br0 May 30 04:06:49 cersei-3 kernel: device veth0 left promiscuous mode May 30 04:06:49 cersei-3 kernel: ov-001001-3hmrf: port 2(veth0) entered disabled state May 30 04:06:49 cersei-3 kernel: device vxlan0 left promiscuous mode May 30 04:06:49 cersei-3 kernel: ov-001001-3hmrf: port 1(vxlan0) entered disabled state May 30 04:06:49 cersei-3 kernel: vx-001001-3hmrf: renamed from vxlan0 May 30 04:06:49 cersei-3 systemd-udevd[7514]: Could not generate persistent MAC address for vx-001001-3hmrf: No such file or directory May 30 04:06:49 cersei-3 kernel: vethc63433f: renamed from veth0 May 30 04:06:49 cersei-3 systemd-udevd[7535]: Could not generate persistent MAC address for vethc63433f: No such file or directory May 30 04:06:49 cersei-3 kernel: veth7c13a0c: renamed from eth1 May 30 04:06:50 cersei-3 dockerd[4658]: time="2018-05-30T04:06:50.622859291Z" level=warning msg="failed to deactivate service binding for container haproxy.pxh1bm6nnfzq7ijfdc76jz4d5.ftle13j0bm23utdzahakfsgfg" error="No such container: haproxy.pxh1bm6nnfzq7ijfdc76jz4d5.ftle13j0bm23utdzahakfsgfg" module=node/agent node.id=pxh1bm6nnfzq7ijfdc76jz4d5 May 30 04:06:52 cersei-3 dockerd[4658]: time="2018-05-30T04:06:52.624354279Z" level=warning msg="Peer operation failed:Unable to find the peerDB for nid:3hmrfrizrkfzgnfkpxp6qqt91 op:&{3 3hmrfrizrkfzgnfkpxp6qqt91 [] [] [] [] false false false DeleteNetwork}" May 30 04:07:10 cersei-3 systemd-udevd[7675]: Could not generate persistent MAC address for veth1c22938: No such file or directory May 30 04:07:10 cersei-3 kernel: device veth1c22938 entered promiscuous mode May 30 04:07:10 cersei-3 kernel: IPv6: ADDRCONF(NETDEV_UP): veth1c22938: link is not ready May 30 04:07:10 cersei-3 systemd-udevd[7673]: Could not generate persistent MAC address for veth6445744: No such file or directory May 30 04:07:10 cersei-3 kernel: docker0: port 2(veth1c22938) entered disabled state May 30 04:07:10 cersei-3 kernel: device veth1c22938 left promiscuous mode May 30 04:07:10 cersei-3 kernel: docker0: port 2(veth1c22938) entered disabled state May 30 04:07:10 cersei-3 dockerd[4658]: time="2018-05-30T04:07:10.179533211Z" level=error msg="49269cdcee79565de737a95bc860eb991e3d6fdd76df8189689e86ac130789b3 cleanup: failed to delete container from containerd: no such container" May 30 04:07:10 cersei-3 systemd-udevd[7775]: Could not generate persistent MAC address for vethc49be85: No such file or directory May 30 04:07:10 cersei-3 kernel: device vethc49be85 entered promiscuous mode May 30 04:07:10 cersei-3 kernel: IPv6: ADDRCONF(NETDEV_UP): vethc49be85: link is not ready May 30 04:07:10 cersei-3 kernel: docker0: port 2(vethc49be85) entered forwarding state May 30 04:07:10 cersei-3 kernel: docker0: port 2(vethc49be85) entered forwarding state May 30 04:07:10 cersei-3 systemd-udevd[7774]: Could not generate persistent MAC address for veth9bf500c: No such file or directory May 30 04:07:10 cersei-3 kernel: IPVS: Creating netns size=2192 id=10 May 30 04:07:10 cersei-3 kernel: br0: renamed from ov-001001-3hmrf May 30 04:07:10 cersei-3 systemd-udevd[7855]: Could not generate persistent MAC address for vx-001001-3hmrf: No such file or directory May 30 04:07:10 cersei-3 kernel: vxlan0: renamed from vx-001001-3hmrf May 30 04:07:10 cersei-3 kernel: device vxlan0 entered promiscuous mode May 30 04:07:10 cersei-3 kernel: br0: port 1(vxlan0) entered forwarding state May 30 04:07:10 cersei-3 kernel: br0: port 1(vxlan0) entered forwarding state May 30 04:07:10 cersei-3 systemd-udevd[7887]: Could not generate persistent MAC address for veth1e799b1: No such file or directory May 30 04:07:10 cersei-3 kernel: veth0: renamed from veth523bd6b May 30 04:07:10 cersei-3 kernel: device veth0 entered promiscuous mode May 30 04:07:10 cersei-3 kernel: br0: port 2(veth0) entered forwarding state May 30 04:07:10 cersei-3 kernel: br0: port 2(veth0) entered forwarding state May 30 04:07:10 cersei-3 dockerd[4658]: time="2018-05-30T04:07:10Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/49269cdcee79565de737a95bc860eb991e3d6fdd76df8189689e86ac130789b3/shim.sock" debug=false module="containerd/tasks" pid=7938 May 30 04:07:10 cersei-3 kernel: IPVS: Creating netns size=2192 id=11 May 30 04:07:10 cersei-3 kernel: eth0: renamed from veth9bf500c May 30 04:07:10 cersei-3 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc49be85: link becomes ready May 30 04:07:10 cersei-3 kernel: br0: port 2(veth0) entered disabled state May 30 04:07:10 cersei-3 kernel: eth1: renamed from veth1e799b1 May 30 04:07:10 cersei-3 kernel: br0: port 2(veth0) entered forwarding state May 30 04:07:10 cersei-3 kernel: br0: port 2(veth0) entered forwarding state May 30 04:07:25 cersei-3 kernel: docker0: port 2(vethc49be85) entered forwarding state May 30 04:07:25 cersei-3 systemd[1]: Starting Cleanup of Temporary Directories... May 30 04:07:25 cersei-3 systemd-tmpfiles[8030]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring. May 30 04:07:25 cersei-3 systemd[1]: Started Cleanup of Temporary Directories. May 30 04:07:25 cersei-3 kernel: br0: port 1(vxlan0) entered forwarding state May 30 04:07:25 cersei-3 kernel: br0: port 2(veth0) entered forwarding state

haproxy is the name of our haproxy container

burgoyn1 commented 6 years ago

Anyone have an update on this?

johnjelinek commented 6 years ago

I'm seeing this issue too.

mateodelnorte commented 5 years ago

I believe we're seeing this issue as well.

swift1911 commented 5 years ago

Anyone have an update on this?

mxkellner commented 1 year ago

Same problem for me. forced new cluster and didn't receive a token after executing docker swarm join-token worker After much trying for me did work:

chrisbecke commented 1 year ago

I experienced exactly the same issue. A swarm lost quorum on AWS because an auto scaling group was a bit too ambitious. Was left with 1 nominally functioning manager that showed all the desired swarm state: services and configs, but would not produce a join token either directly or after a --force-new-cluster.

Even after successfully executing --force-new-cluster, docker info showed "IsManager: false", but commands like node ls, service ls etc. succeed.