moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.34k stars 611 forks source link

Problems with raft and quorum after connecting new managers #2670

Open davidgibbons opened 6 years ago

davidgibbons commented 6 years ago

Docker version: Managers: 18.04.0-ce Workers: 17.12.0-ce

My clusters have been around for a few months. I was trying to rebuild the cluster to get them on consistent and updated versions. In the process of that I created a new autoscale group and connected new 18.04.0-ce managers to my existing cluster.

During the process every new node would fail with context deadline exceeded.

Normally we run with 5 manager nodes. During the process I attempted to scale down to 1 management node. I removed the previous nodes with docker node rm.

At this point adding any additional nodes would continue to fail: Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

At some point during this I tried to force a new cluster with docker swarm init --force-new-cluster. That worked until a manager attempted to join, then it would say there were not enough managers and drop the cluster back offline.

This process repeated 2 or 3 times.

docker swarm init --force-new-cluster

Swarm initialized: current node (sdgd5lpphdzmjrlnl5cpkhihm) is now a manager.

docker node ls

Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.

I finally found this docker-for-aws thread and ran the suggested adjustment for snapshot duration: https://github.com/docker/for-aws/issues/81

That appeared to settle things out, adding new nodes afterwards was fine and functional

At this point i'm doing our production environment in two weeks, I can update this further if the issue arrises again.

CC @johnharris85

mcassaniti commented 5 years ago

I just want to check that I've found the same issue that I'm seeing. The order of issues is:

  1. The swarm originally had more than one managers, but has since been shrunk to a single manager. There are currently no other managers in the swarm. The manager works correctly with no errors in the logs, and workers can talk to it without issue. There are no errors in the worker logs about talking to the manager.
  2. When promoting a worker node to a manager it attempts to connect to old managers immediately. This results in quorum loss.
  3. Stopping the recently promoted node and forcing the original manager to be a new cluster is the only way to bring quorum back.
  4. Once quorum is restored the cluster operates correctly. Removing the contents of /var/lib/docker/swarm on the promoted node, starting docker and joining it back to the swarm as a worker does not result in any issues.

Is this the issue you are also seeing? If so I have this issue for docker-18.09.1-ce.