Closed apophizzz closed 7 years ago
/cc @aaronlehmann
I'm not sure if that's a bug at all or if re-initializing the Raft cluster simply needs a majority of available managers (2 in this case) for being able to return into operating state.
In general the Raft cluster needs a majority of managers to be available, otherwise it can't service most requests. Even though you're still able to do node ls
after taking down all but one manager, trying to do something that changed the state, such as creating or updating a service, would fail.
There's more information about this at the following links:
https://docs.docker.com/engine/swarm/raft/ https://docs.docker.com/engine/swarm/admin_guide/
I'll close this issue, since the behavior seems to be expected. But feel free to follow up with questions.
@aaronlehmann thanks, what you explained makes perfectly sense to me. Changing cluster state always needs consensus of a manager majority.
But I'm not sure if I already understood why I need a second manager node to come back online for the first manager to be able to report the global cluster state (i.e. recognizing that all the other nodes are down). I think of a scenario where each machine belonging to the cluster has been shutdown before.
Am I right, that every manager node has a local copy of the entire cluster state rather than just a partial view of the cluster? If that assumption is true (otherwise feel free to correct me), can you think of a reason why docker node ls
can't be satisfied from the local cluster store? Is that a Raft bootstrapping thing?
Thanks for your patience :)
Am I right, that every manager node has a local copy of the entire cluster state rather than just a partial view of the cluster? If that assumption is true (otherwise feel free to correct me), can you think of a reason why docker node ls can't be satisfied from the local cluster store? Is that a Raft bootstrapping thing?
In order to make sure that a query like this returns the most up-to-date information, we internally route the queries to the Raft leader. The leader is the node that's currently coordinating the writes, so it's impossible for it to have out-of-date information.
I think the case where you're seeing docker node ls
work with only one manager is one where that node still thinks it's the leader, so it's willing to respond to that query. But in the case where it doesn't work, the node doesn't know who the leader is, so it doesn't know where to send the query.
I believe that in the upcoming 17.04 release, nodes will automatically relinquish the leadership position if they notice that they aren't in contact with enough nodes to maintain a quorum. This will make the behavior more consistent.
Admittedly our error message about "context deadline exceeded" is unhelpful here. We've been meaning to improve these errors (cc @dperny).
Thanks for your patience :)
It's no problem at all.
Great, now I actually understand what was going on.
We've been meaning to improve these errors
I share your opinion that a significantly more meaningful message would be great here. The current one might a little bit confusing. So I'm done with that issue, thanks for participating.
hi everybody i use docker 1.12.6, and i think that i have the seem problem. I have no answer for docker ps command or docker-compose ps command. in the log i have : Mar 27 15:37:40 test-ran.wanesy.fr dockerd[1059]: time="2017-03-27T15:37:40.796732144+02:00" level=warning msg="Health check error: rpc error: code = 4 desc = context deadline exceeded"
@lwalid your swarm doesn't have a quorum, as a temporary solution to this issue you can use:
docker swarm init --force-new-cluster --advertise-addr <addr>:<port> --listen-addr <addr>:<port>
this will recreate the node from current state and will update managers/workers balance.
Description
I've got a test environment for Docker Swarm Mode locally on my machine, conforming to the setup of this Docker Lab. The setup is made up of 3 manager as well as 3 worker nodes. When trying to restart the existing swarm with all machines in stopped state,
docker node ls
is running into a timeout when I start with bringing back the leading manager node (let's call it manager1) to life via Docker Machine. The error message I get is:Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
The other node's machines are still disabled at this point. The interesting thing is that as soon as I start another one of the remaining manager nodes (manager2 or manager3) and repeat the command on manager1 (leader) afterwards, it works as expected and gives me output like this:
Even if I now decide to shutdown manager2 again, everything keeps working fine. I'm not sure if that's a bug at all or if re-initializing the Raft cluster simply needs a majority of available managers (2 in this case) for being able to return into operating state. Moreover, I made an interesting observation examining the output of
docker info
before and after restarting a second manager. With only manager1 running and getting the described error message, the Raft section of thedocker info
output says:Along with the error message, that makes totally sense. After having started manager2 (error vanished):
Steps to reproduce the issue:
docker-machine stop manager1
(repeat for all machines) without explicitly leaving the existing swarm.docker-machine start manager1
.docker node ls
. You should get the error message I described.docker node ls
again. Alternatively, you can also execute this command on the follower node, which should not make any difference.docker node ls
again on manager1. Again, you should get a list of nodes, showing that only manager1 is currently ready and reachable.Expected results: Instead of facing an error message after only the leading manager node has come back online, I'd expect the same output concerning the cluster state I get after starting another manager node, i.e.:
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Cheers, Patrick