Open friism opened 7 years ago
Oh, I found another problem that in the display above:
CA Configuration:
Expiry Duration: Less than a second
I mean that Expiry Duration is at least one hour, but here shows Less than a second
. Am I wrong? @aaronlehmann
Running commands that hit the Swarm API should result in a prompt error saying something like "This node is outside the Swarm quorum", maybe with a recommendation to use one of the managers that are supposed to still be in quorum (those could perhaps be listed) and alternatively suggesting rebuilding quorum from this manager (if that's possible).
The way it works right now is that we wait up to a certain amount of time for a response from the leader, or for a leader to emerge if there is none. I think this general approach is sound. We don't want random commands to fail if there happens to be a leader election going on at that moment. But I think the error message could be much better. Instead of a generic timeout, we could explain that this node isn't able to reach the leader, or that it wasn't able to elect a leader. Listing the other manager addresses is a good idea.
cc @LK4D4
Oh, I found another problem that in the display above:
Yes, it looks like a bug that these values are printed even though they are unknown. That should probably be filed as a separate bug.
@allencloud do you want to file an issue for that?
OK, I will do that if I could reproduce this ASAP. Thanks a lot @friism
I reproduced this issue. While maybe some more information is needed.
See also #29987 - it would be great to improve our error messages to make it clearer why particular commands time out when quorum is lost.
A manager that's stuck outside of quorum doesn't fail in a good way. For a test, I started a 3-manager swarm and then terminated two managers. The 3rd managers now believes itself to be outside the quorum.
What I expected
Running commands that hit the Swarm API should result in a prompt error saying something like "This node is outside the Swarm quorum", maybe with a recommendation to use one of the managers that are supposed to still be in quorum (those could perhaps be listed) and alternatively suggesting rebuilding quorum from this manager (if that's possible).
What I got
Swarm-related commands (eg.
docker node ls
) hang or time out.Additional info