Better failure mode for marooned managers

friism commented 7 years ago

A manager that's stuck outside of quorum doesn't fail in a good way. For a test, I started a 3-manager swarm and then terminated two managers. The 3rd managers now believes itself to be outside the quorum.

What I expected

Running commands that hit the Swarm API should result in a prompt error saying something like "This node is outside the Swarm quorum", maybe with a recommendation to use one of the managers that are supposed to still be in quorum (those could perhaps be listed) and alternatively suggesting rebuilding quorum from this manager (if that's possible).

What I got

Swarm-related commands (eg. docker node ls) hang or time out.

Additional info

~ $ docker info
Containers: 5
 Running: 4
 Paused: 0
 Stopped: 1
Images: 5
Server Version: 1.13.0-rc3
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: awslogs
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: h3tx9ctw18xgbj1u1ogx6ba5l
 Error: rpc error: code = 4 desc = context deadline exceeded
 Is Manager: true
 ClusterID:
 Managers: 0
 Nodes: 0
 Orchestration:
  Task History Retention Limit: 0
 Raft:
  Snapshot Interval: 0
  Heartbeat Tick: 0
  Election Tick: 0
 Dispatcher:
  Heartbeat Period: Less than a second
 CA Configuration:
  Expiry Duration: Less than a second
 Node Address: 172.31.7.108
 Manager Addresses:
  172.31.20.235:2377
  172.31.20.236:2377
  172.31.7.108:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 51371867a01c467f08af739783b8beafc154c4d7
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.8.12-moby
Operating System: Alpine Linux v3.4
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.67 GiB
Name: ip-172-31-7-108.us-west-2.compute.internal
ID: O7M6:AEHD:XE3J:BF3H:M4SI:CWYD:FEVN:5RCH:MUKI:HBYD:ZLVS:BU5L
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 61
 Goroutines: 115
 System Time: 2016-12-13T20:07:24.406563198Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

allencloud commented 7 years ago

Oh, I found another problem that in the display above:

CA Configuration:
  Expiry Duration: Less than a second

I mean that Expiry Duration is at least one hour, but here shows Less than a second. Am I wrong? @aaronlehmann

aaronlehmann commented 7 years ago

Running commands that hit the Swarm API should result in a prompt error saying something like "This node is outside the Swarm quorum", maybe with a recommendation to use one of the managers that are supposed to still be in quorum (those could perhaps be listed) and alternatively suggesting rebuilding quorum from this manager (if that's possible).

The way it works right now is that we wait up to a certain amount of time for a response from the leader, or for a leader to emerge if there is none. I think this general approach is sound. We don't want random commands to fail if there happens to be a leader election going on at that moment. But I think the error message could be much better. Instead of a generic timeout, we could explain that this node isn't able to reach the leader, or that it wasn't able to elect a leader. Listing the other manager addresses is a good idea.

cc @LK4D4

Oh, I found another problem that in the display above:

Yes, it looks like a bug that these values are printed even though they are unknown. That should probably be filed as a separate bug.

friism commented 7 years ago

@allencloud do you want to file an issue for that?

allencloud commented 7 years ago

OK, I will do that if I could reproduce this ASAP. Thanks a lot @friism

allencloud commented 7 years ago

I reproduced this issue. While maybe some more information is needed.

my docker version is 1.12.5
When Swarm has only 3 managers, we can not reproduce this issue if we leave 2 non-leader managers. Only if we make one leader and one non-leader manager leave could I reproduce this issue.

aaronlehmann commented 7 years ago

See also #29987 - it would be great to improve our error messages to make it clearer why particular commands time out when quorum is lost.

moby / moby