Open davidgibbons opened 6 years ago
I just want to check that I've found the same issue that I'm seeing. The order of issues is:
/var/lib/docker/swarm
on the promoted node, starting docker and joining it back to the swarm as a worker does not result in any issues.Is this the issue you are also seeing? If so I have this issue for docker-18.09.1-ce.
Docker version: Managers: 18.04.0-ce Workers: 17.12.0-ce
My clusters have been around for a few months. I was trying to rebuild the cluster to get them on consistent and updated versions. In the process of that I created a new autoscale group and connected new 18.04.0-ce managers to my existing cluster.
During the process every new node would fail with context deadline exceeded.
Normally we run with 5 manager nodes. During the process I attempted to scale down to 1 management node. I removed the previous nodes with docker node rm.
At this point adding any additional nodes would continue to fail: Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
At some point during this I tried to force a new cluster with docker swarm init --force-new-cluster. That worked until a manager attempted to join, then it would say there were not enough managers and drop the cluster back offline.
This process repeated 2 or 3 times.
docker swarm init --force-new-cluster
Swarm initialized: current node (sdgd5lpphdzmjrlnl5cpkhihm) is now a manager.
docker node ls
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
I finally found this docker-for-aws thread and ran the suggested adjustment for snapshot duration: https://github.com/docker/for-aws/issues/81
That appeared to settle things out, adding new nodes afterwards was fine and functional
At this point i'm doing our production environment in two weeks, I can update this further if the issue arrises again.
CC @johnharris85