Open jumaffre opened 3 years ago
This short workshop paper might be useful: https://dl.acm.org/doi/pdf/10.1145/3447851.3458739
I agree that these might be worth considering. Essentially, liveness in Raft depends on establishing a stable leader who is able to make progress, who is not unduly forced to step down, and who can receive and reply to client requests.
Omission faults present 4 main issues for Raft:
As we discuss in the blog post, you do have to careful when modifying Raft for omission faults to not introduce new liveness issues elsewhere. PreVote and CheckQuorum are definitely worth considering, though they don't always cover the case where a link between nodes might drop arbitrary messages (a flaky link). For simplicity, the blog post assumes either an ideal link between nodes or no link at all.
As discussed with @heidihoward, some points to take into consideration when implementation the CheckQuorum extension:
f
) of backup nodes for the duration of the election timeout (not randomised like on the backup nodes). Note that the stepping down primary should become follower in the same term as it previously was primary in (i.e. no increase in term), while keeping tracking of the votes it has already handed in in this term (i.e. for itself).Also to explore: To reduce service unavailability, it may be nice for the primary node to tell the other nodes when it steps down so that other nodes don't wait for their election timeout to expire to try to elect a new primary node.
Leader stickiness is interesting independently from PreVote, because it gives some defense against another node's clock running fast.
Adding tla+ tag as it might be worth spec'ing this change out before implementing it
Our current implementation of CFT consensus still suffers from some limitations and liveness issues around omission faults (i.e. dropped messages). This is especially the case if one or several nodes are partitioned out of the service, as demonstrated by the end-to-end tests introduced in #2553: a single partitioned backup will automatically become candidate if it was partitioned for >=
election_timeout
.Note that this is only true when no new write transactions are processed by the current leader. Otherwise, the partitioned node wouldn't be able to win an election as its last known
seqno
would be behind.The following two extensions should help mitigate this family of issues:
1. PreVote
Each potential candidate should first check that a quorum of nodes would accept this node as the primary should it become candidate. Only then the node should transition to a candidate state and request votes from other nodes.
Other nodes should respond to
PreVote
messages as if it was a real election, but don't need to keep track of which nodes they have granted theirPreVote
. It is only when a quorum of nodes have responded positively to thePreVote
round that the node can become candidate.2. Leader stickiness/CheckQuorum
The goal here is to make sure that a primary stays primary for as long as possible, i.e. doesn't step down because one node only started an election.
Nodes should grant their
PreVote
/Vote
s if they haven't heard from a primary within their election timeout. As I understand it, this implies that a node should only grantPreVote
/Vote
s when it is already in the new "campaign" (is this a good name for it?) or candidate state.Moreover, a primary should actively step down (i.e. become a follower in the same term of its primary-ness) if it hasn't heard
AppendEntries
responses from a majority of backups within the election timeout.Note that this also impacts the "sunny day" election scenario as the first half of the nodes whose election timeout expires wouldn't manage to get a quorum of nodes (because these ones still known about the current leader and haven't yet timed out). This is also a positive change as this would basically average out the election timeout of the service over a quorum of nodes rather than be set by the single node with the smallest election timeout.
Sources: