microsoft / CCF

Confidential Consortium Framework
https://microsoft.github.io/CCF/
Apache License 2.0
786 stars 215 forks source link

Raft extensions for omission faults #2577

Open jumaffre opened 3 years ago

jumaffre commented 3 years ago

Our current implementation of CFT consensus still suffers from some limitations and liveness issues around omission faults (i.e. dropped messages). This is especially the case if one or several nodes are partitioned out of the service, as demonstrated by the end-to-end tests introduced in #2553: a single partitioned backup will automatically become candidate if it was partitioned for >= election_timeout.

Note that this is only true when no new write transactions are processed by the current leader. Otherwise, the partitioned node wouldn't be able to win an election as its last known seqno would be behind.

The following two extensions should help mitigate this family of issues:

1. PreVote

Each potential candidate should first check that a quorum of nodes would accept this node as the primary should it become candidate. Only then the node should transition to a candidate state and request votes from other nodes.

Other nodes should respond to PreVote messages as if it was a real election, but don't need to keep track of which nodes they have granted their PreVote. It is only when a quorum of nodes have responded positively to the PreVote round that the node can become candidate.

2. Leader stickiness/CheckQuorum

The goal here is to make sure that a primary stays primary for as long as possible, i.e. doesn't step down because one node only started an election.

Nodes should grant their PreVote/Votes if they haven't heard from a primary within their election timeout. As I understand it, this implies that a node should only grant PreVote/Votes when it is already in the new "campaign" (is this a good name for it?) or candidate state.

Moreover, a primary should actively step down (i.e. become a follower in the same term of its primary-ness) if it hasn't heard AppendEntries responses from a majority of backups within the election timeout.

Note that this also impacts the "sunny day" election scenario as the first half of the nodes whose election timeout expires wouldn't manage to get a quorum of nodes (because these ones still known about the current leader and haven't yet timed out). This is also a positive change as this would basically average out the election timeout of the service over a quorum of nodes rather than be set by the single node with the smallest election timeout.

Sources:

heidihoward commented 2 years ago

This short workshop paper might be useful: https://dl.acm.org/doi/pdf/10.1145/3447851.3458739

heidihoward commented 2 years ago

I agree that these might be worth considering. Essentially, liveness in Raft depends on establishing a stable leader who is able to make progress, who is not unduly forced to step down, and who can receive and reply to client requests.

Omission faults present 4 main issues for Raft:

As we discuss in the blog post, you do have to careful when modifying Raft for omission faults to not introduce new liveness issues elsewhere. PreVote and CheckQuorum are definitely worth considering, though they don't always cover the case where a link between nodes might drop arbitrary messages (a flaky link). For simplicity, the blog post assumes either an ideal link between nodes or no link at all.

jumaffre commented 2 years ago

As discussed with @heidihoward, some points to take into consideration when implementation the CheckQuorum extension:

Also to explore: To reduce service unavailability, it may be nice for the primary node to tell the other nodes when it steps down so that other nodes don't wait for their election timeout to expire to try to elect a new primary node.

eddyashton commented 1 year ago

Leader stickiness is interesting independently from PreVote, because it gives some defense against another node's clock running fast.

heidihoward commented 1 year ago

Adding tla+ tag as it might be worth spec'ing this change out before implementing it