In Vault, we've encountered an issue where losing a voter, then a non-voter in the same redundancy zone, in short succession, would lead to the cluster becoming unhealthy (losing an active node).
In the scenario above, the following would happen before the change in this PR (write-up stolen from @banks):
When a voter fails it is “demoted” but what this actually means is that Autopilot will first promote the follower (with a raft reconfiguration).
At some time later once the new voter is healthy will remove the old voter and actually “demote” them to non-voter.
But if the new voter also fails before the first one is demoted, now we have a quorum of 4 but only 2 available servers and so the leader is forced to step down as it no longer has a majority of voters.
In Vault, we've encountered an issue where losing a voter, then a non-voter in the same redundancy zone, in short succession, would lead to the cluster becoming unhealthy (losing an active node).
In the scenario above, the following would happen before the change in this PR (write-up stolen from @banks):