Closed eddyashton closed 3 years ago
With additional logging, it looks like the problem is a term history mismatch.
On node 11:
2021-02-10T15:05:19.634167Z -0.003 0 [info ] ../src/consensus/aft/impl/state.h:38 | Updating view to: 17 at version: 151
2021-02-10T15:05:19.634223Z -0.003 0 [info ] ../src/consensus/aft/impl/state.h:55 | Resulting views: 1, 1, 22, 33, 44, 54, 65, 77, 88, 99, 110, 121, 130, 135, 140, 147, 151
Whereas on node 12:
2021-02-10T15:05:20.616682Z -0.008 0 [info ] ../src/consensus/aft/impl/state.h:38 | Updating view to: 17 at version: 153
2021-02-10T15:05:20.616739Z -0.008 0 [info ] ../src/consensus/aft/impl/state.h:55 | Resulting views: 1, 1, 22, 33, 44, 54, 65, 77, 88, 99, 110, 121, 130, 135, 140, 147, 153
And as a result, when node 12 received an append entries from 11:
2021-02-10T15:05:20.616925Z -0.008 0 [info ] ../src/consensus/aft/raft.h:1016 | Received pt: 17 pi: 151 t: 17 i: 151 toi: 17
2021-02-10T15:05:20.616984Z -0.008 0 [info ] ../src/consensus/aft/raft.h:1100 | Previous term for 151 should be 16
2021-02-10T15:05:20.617085Z -0.008 0 [info ] ../src/consensus/aft/raft.h:1122 | Recv append entries to 12 from 11 but our log at 151 has the wrong previous term (ours: 16, theirs: 17)
I'm now trying to trace exactly how each node has advanced its term history.
Looks like it is related to snapshots, as I've tried increasing the snapshot threshold from 20KB
to 200MB
, and it no longer occurs (still confirming this with a bunch more runs). Losing focus with these logs for the day, but something along the lines of "the snapshot means that the primary maintains a last-globally-committ[ed|able]
value that is later than it should be, so they don't roll back far enough on elections"? Will investigate further tomorrow.
I have a more minimal/quicker local repro where the first phase of the test (adding a node and retiring the primary) is skipped, which means there are only 3 nodes during the lifetime of the service.
The issue seems to be that the new primary Node 1
in view 13
cannot make progress in a two-node network (previous primary Node 2
is still suspended. I'll try to explain the scenario below:
Node 2
) is suspended, and Node 1
becomes primary:10:34:38.803 | INFO | infra.node:suspend:351 - Node 2 suspended...
10:34:38.803 | INFO | infra.network:wait_for_new_primary:816 - Waiting up to 16.0s for a new primary (different from 2) to be elected...
10:34:50.643 | WARNING | infra.network:find_primary:715 - Could not successfully connect to node 2. Retrying...
10:34:50.927 | INFO | infra.network:wait_for_new_primary:827 - [1||] GET /node/network
10:34:50.927 | INFO | infra.network:wait_for_new_primary:827 - 200 @13.50 {"current_view":13,"primary_id":1,"service_status":"OPEN","view_change_in_progress":false}
Note that it took 4 tries to elect a new leader in view 13
:
Node 0
became candidate in view 10
, Node 1
became follower in view 10
but responded false
to the vote request because of the committable checks (Should Node 1
become follower if the consistency checks fail?)Node 1
became candidate in view 11
, Node
0became follower in view
11and responded
trueto the vote request, so
Node 1became leader in view
11`.11
, Node 1
sends append entries to Node 0
but these are rejected because of term consistency checks:2021-02-11T10:34:42.992144Z -0.003 0 [debug] ../src/consensus/aft/raft.h:1016 | Received pt: 9 pi: 49 t: 11 i: 49 toi: 9
2021-02-11T10:34:42.992180Z -0.003 0 [debug] ../src/consensus/aft/raft.h:1100 | Previous term for 49 should be -9223372036854775808
2021-02-11T10:34:42.992210Z -0.004 0 [debug] ../src/consensus/aft/raft.h:1122 | Recv append entries to 0 from 1 but our log at 49 has the wrong previous term (ours: -9223372036854775808, theirs: 9)
// Node 0 hasn't got idx 49 so request 46+ instead
2021-02-11T10:34:42.995441Z -0.007 0 [debug] ../src/consensus/aft/raft.h:1016 | Received pt: 9 pi: 45 t: 11 i: 49 toi: 9
2021-02-11T10:34:42.995480Z -0.007 0 [debug] ../src/consensus/aft/raft.h:1100 | Previous term for 45 should be 8
2021-02-11T10:34:42.995511Z -0.007 0 [debug] ../src/consensus/aft/raft.h:1122 | Recv append entries to 0 from 1 but our log at 45 has the wrong previous term (ours: 8, theirs: 9)
Node 0
eventually times out and becomes candidate in view 12
and Node 1
becomes follower but Node 1
responds false
to the vote request because of the committable checks. Node 1
became candidate in view 13
, Node
0became follower in view
13and responded
trueto the vote request, so
Node 1became leader in view
13`.Node 1
as the new leader after Node 2
has been suspended but the election bounce offs are not yet finished...Node 0
still rejects append entries from the new leader Node 1
because of a term mismatch:2021-02-11T10:34:49.643714Z -0.007 0 [debug] ../src/consensus/aft/raft.h:1016 | Received pt: 11 pi: 50 t: 13 i: 50 toi: 11
2021-02-11T10:34:49.643766Z -0.007 0 [debug] ../src/consensus/aft/raft.h:1100 | Previous term for 50 should be -9223372036854775808
2021-02-11T10:34:49.643819Z -0.007 0 [debug] ../src/consensus/aft/raft.h:1122 | Recv append entries to 0 from 1 but our log at 50 has the wrong previous term (ours: -9223372036854775808, theirs: 11)
// Node 0 hasn't got idx 50 so request 46+ instead
2021-02-11T10:34:49.652434Z -0.008 0 [debug] ../src/consensus/aft/raft.h:1016 | Received pt: 9 pi: 45 t: 13 i: 50 toi: 11
2021-02-11T10:34:49.652507Z -0.008 0 [debug] ../src/consensus/aft/raft.h:1100 | Previous term for 45 should be 8
2021-02-11T10:34:49.652555Z -0.008 0 [debug] ../src/consensus/aft/raft.h:1122 | Recv append entries to 0 from 1 but our log at 45 has the wrong previous term (ours: 8, theirs: 9)
2021-02-11T10:34:49.652636Z -0.008 0 [debug] ../src/consensus/aft/raft.h:1442 | Send append entries response from 0 to 1 for index 45: 1
--
As @eddyashton pointed out, it seems that this is a view consistency failure on the node remaining nodes (Node 0
and Node 1
). Investigating where the inconsistency comes from now...
We now have consistent repros, a good idea of what's broken, and we're narrowing in on a potential fix.
The problem is in how we handle rollback, vs how Raft handles overwriting of inconsistent logs:
In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction
If you receive an append entries, you should apply it to your log no questions asked (or blow up if you think something is actually broken). Our rollback behaviour tries to make this trivial - you've rolled back to what they might send you. But actually we don't rollback far enough. Specifically we only roll back to our last committable index. This is safe and necessary when we first become a follower:
But it is valid for a primary to be deposed, and its committable (but not committed) state to be lost by the service. When this old primary receives an append entries from the new primary, even though it voted against them, it should accept the new term and roll back to the append entries' index. The safety check here should be that we don't rollback committed state (in fact we should probably have this inside rollback, regardless of the calling reason), but it is safe to rollback committable state, which we currently don't do.
In summary: In become_follower
we do rollback(last_committable_index());
. This is necessary for the case where this is called from recv_request_vote_response
. But we need to also call something like rollback(idx)
in recv_append_entries
, for when we've lost an election and need to remove our uncommitted suffix.
@eddyashton that makes sense to me, and I think the fix you suggest in recv_append_entries is correct too, with a minor tweak: upon receiving an append entries for a higher term, but I think we may want to roll back to min(our local opinion of the commit index, the new primary's commit index) rather than to the index of the received append entry, and only apply it if they're continuous.
Actually I don't think we want to roll back this blindly unless the terms differ, because we could lose persistence of the committable but not yet committed suffix. I think we want to keep all indices for which the term is compatible with the statement made by the latest append entry.
Some quick notes from a long discussion about this:
Tracking when need to rollback is non-trivial. It seems its not possible with the current state to make a local determination, when receiving an append entries, which are matching ones from terms you agree on and which should overwrite your suffix (/trigger a rollback).
The index that we ACK when responding to append entries is also 'too high'. Either we need to more cautiously track what is safe to ACK (where we currently optimistically report it is the last thing we know about), or have done the rollback before we build this response.
The difficulty is in the general case. An append entries can say "Here's the entries from 3.15 to 4.50, and I'm in term 4". I may have correct values from term 3, up to @3.25
, but also have a suffix up to @3.40
which is no longer valid (the new primary did not have). We currently detect term changes within this batch (ie - the fact that 4
starts at @4.26
) by replaying it and processing its signature entries. But we also skip replaying any entries we already have. We want the receiver to decide "roll back to 25, replay the batch from 26, and ack up to 50". This may need separate watermarks ("what's the absolute latest entry I have" vs "what's the latest thing I'm sure is correct" vs "what's the latest thing I've acked to the primary"), or we may need to be more pessimistic and rollback further.
Resolved in #2192. Summarising the problem again for posterity, with a hopefully elucidating diagram:
a b c d
1 | | | | | | | | | | | X
2 | | | | | | | O
3 | | | | | O
4 | | O
5 | | O
1 is the primary. Primary 1 has last told everyone that commit was at a
.
O
marks the last committable index each node has received (for simplicity, all of the |
transactions are committable as well). All of these O
s have been ACKd, and the primary has received all of these ACKs. Given the current ACKs, the primary knows commit has advanced to b
, and can start advertising this to clients.
If primary 1 dies here, only nodes 2 or 3 can win the subsequent election, since the votes of 4 and 5 will get either over the majority (but neither would vote for 4 or 5).
If 3 becomes a candidate, it proposes it will start its new term from b
. 2 will vote against this, but 3 will win regardless. This is correct - to win an election, 3 must have the index that 1 knew was committed, even if 3 doesn't know it was committed.
So the transactions at c
and c-1
currently held by 2 are part of a dead suffix, along with everything between c
and d
- not enough nodes knew them before the primary dead, and they weren't saved by the elected node, so they are lost.
When node 2 receives a message from 3 saying "I will be giving you indices from b
" (ie - an append_entries with prev_idx == b
), 2 must rollback to b
to continue the persisted ledger.
A followup from this is described in #2197; we should retain these dead forks on disk if we can, rather than aggressively truncating them.
rotation_test
has been failing regularly in the CI, example runs here and here. Primarily, but not exclusively, in theNoSGX_Instrumented
build, suggesting it is a timing issue. The failure appears to be an election deadlock; the surviving nodes get into a cycle of elections, and although they pass we continue to start subsequent elections.We disabled it here to avoid blocking PRs with unrelated CI failures.
I've finally obtained a repro on my local machine, so here's a representative log walkthrough. Just looking at the logs and pointing out anything odd for now, will clarify this with the code later.
Infra
We've just unsuspended 11, and waited until we think the network has healed (node 12 says commit has advanced to
17.139
, comitting the write we just did@17.138
). Then we suspend node 12, wait for a new primary to be elected, and check that we can write to this node. This write (@21.140
) never commits - global commit remains at17.137
on node 10.The output from the 3 involved nodes: node10.out.log node11.out.log node12.out.log
Node 12
Node 12 shows roughly what we expect - it becomes leader in term 17:
Then its last log is at
09:56:20
, it gets suspended shortly after.There are only a few suspicious lines:
Are we doing expensive snapshot work that is blocking Raft liveness?
11 is still appending from an old term. This instance seems harmless - 11 was the leader and they correct themselves quickly afterwards.
Node 11
Node 11 becomes a follower in 17, but with slighly odd surroundings:
I guess the
request vote response
is from a previous term, or perhaps even the election which made us leader? We should log the term of this response, and make it look less like an error if its a late (but valid) vote for us for an election we just won. I don't know why we're voting against 12 immediately after becoming a follower. Again, we should log the term this vote is about.After that 12 goes to sleep, so 11 starts a new election as candidate:
But node 10 votes against for some reason:
We repeat this for several terms. We briefly become follower in 21, but this times out just like 17 did:
This results in a very suspicious blob of late appends, which for some reason have reached us after we gave up on this term:
Node 10
Node 10 is the log I have the most questions about.
Why does it continually become follower but vote against?
Are we queuing comms to 11 because it is somehow still unknown to us?
General thoughts
I think I understand the initial instability - node 10 has the latest, committed state (
@17.138
), which 12 wrote, but 11 has not yet seen. But 11 keeps calling elections first (due to lucky timeouts), which it can never win. This is known Raft pathology - you're going to have unsuccessful elections ~50% of the time while 12 is dead, but eventually 10 should win. That explains some election instability, but not why 10 is then deposed from term 21.