rotation_test failure - Githubissues

rotation_test has been failing regularly in the CI, example runs here and here. Primarily, but not exclusively, in the NoSGX_Instrumented build, suggesting it is a timing issue. The failure appears to be an election deadlock; the surviving nodes get into a cycle of elections, and although they pass we continue to start subsequent elections.

We disabled it here to avoid blocking PRs with unrelated CI failures.

I've finally obtained a repro on my local machine, so here's a representative log walkthrough. Just looking at the logs and pointing out anything odd for now, will clarify this with the code later.

Infra

47: 09:56:20.958 | INFO     | infra.node:resume:355 - Node 11 has resumed from suspension.
47: 09:56:20.976 | INFO     | reconfiguration:check_can_progress:35 - [12||] GET /node/commit
47: 09:56:20.976 | INFO     | reconfiguration:check_can_progress:35 - 200 {"seqno":137,"view":17}
47: 09:56:20.999 | INFO     | reconfiguration:check_can_progress:37 - [12|user0|] POST /app/log/private {'id': 42, 'msg': 'Hello world'}
47: 09:56:20.999 | INFO     | reconfiguration:check_can_progress:37 - 200 @17.138 true
47: 09:56:21.001 | INFO     | reconfiguration:check_can_progress:40 - [12||] GET /node/commit
47: 09:56:21.001 | INFO     | reconfiguration:check_can_progress:40 - 200 {"seqno":137,"view":17}
<-SNIP->
47: 09:56:22.020 | INFO     | reconfiguration:check_can_progress:40 - [12||] GET /node/commit
47: 09:56:22.020 | INFO     | reconfiguration:check_can_progress:40 - 200 {"seqno":139,"view":17}
47: 09:56:22.020 | SUCCESS  | suite.test_requirements:wrapper:19 - Suspend and resume primary
47: 09:56:22.072 | INFO     | infra.network:find_primary:726 - [10||] GET /node/network
47: 09:56:22.072 | INFO     | infra.network:find_primary:726 - 200 @17.139 {"current_view":17,"primary_id":12,"service_status":"OPEN","view_change_in_progress":false}
47: 09:56:22.072 | INFO     | infra.node:suspend:351 - Node 12 suspended...
47: 09:56:22.072 | INFO     | infra.network:wait_for_new_primary:816 - Waiting up to 16.0s for a new primary (different from 12) to be elected...
47: 09:56:33.187 | WARNING  | infra.network:find_primary:715 - Could not successfully connect to node 12. Retrying...
47: 09:56:33.408 | INFO     | infra.network:wait_for_new_primary:827 - [10||] GET /node/network
47: 09:56:33.408 | INFO     | infra.network:wait_for_new_primary:827 - 200 @21.139 {"current_view":21,"primary_id":10,"service_status":"OPEN","view_change_in_progress":false}
47: 09:56:33.408 | DEBUG    | __main__:test_suspend_primary:18 - New primary is 10 in term 21
47: 09:56:33.424 | INFO     | reconfiguration:check_can_progress:35 - [10||] GET /node/commit
47: 09:56:33.424 | INFO     | reconfiguration:check_can_progress:35 - 200 {"seqno":137,"view":17}
47: 09:56:33.451 | INFO     | reconfiguration:check_can_progress:37 - [10|user0|] POST /app/log/private {'id': 42, 'msg': 'Hello world'}
47: 09:56:33.451 | INFO     | reconfiguration:check_can_progress:37 - 200 @21.140 true
47: 09:56:33.453 | INFO     | reconfiguration:check_can_progress:40 - [10||] GET /node/commit
47: 09:56:33.453 | INFO     | reconfiguration:check_can_progress:40 - 200 {"seqno":137,"view":17}
<-SNIP->
47: 09:56:36.420 | INFO     | reconfiguration:check_can_progress:40 - [10||] GET /node/commit
47: 09:56:36.420 | INFO     | reconfiguration:check_can_progress:40 - 200 {"seqno":137,"view":17}
47: 09:56:36.521 | INFO     | infra.remote:stop:489 - [127.125.205.238] closing

We've just unsuspended 11, and waited until we think the network has healed (node 12 says commit has advanced to 17.139, comitting the write we just did @17.138). Then we suspend node 12, wait for a new primary to be elected, and check that we can write to this node. This write (@21.140) never commits - global commit remains at 17.137 on node 10.

The output from the 3 involved nodes: node10.out.log node11.out.log node12.out.log

Node 12

Node 12 shows roughly what we expect - it becomes leader in term 17:

2021-02-10T09:56:20.011485Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1944     | Becoming candidate 12: 17
2021-02-10T09:56:20.011718Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1758     | Send request vote from 12 to 11
2021-02-10T09:56:20.011858Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1758     | Send request vote from 12 to 10
2021-02-10T09:56:20.018768Z -0.003 0   [info ] ../src/consensus/aft/raft.h:1921     | Recv request vote response to 12 from 10: they voted yes
2021-02-10T09:56:20.018882Z -0.003 0   [info ] ../src/consensus/aft/raft.h:1984     | Becoming leader 12: 17

Then its last log is at 09:56:20, it gets suspended shortly after.

There are only a few suspicious lines:

2021-02-10T09:56:20.074013Z        100 [info ] ../src/host/snapshot.h:176           | Committing snapshot file "snapshot_130_131" with evidence proof committed at 135

Are we doing expensive snapshot work that is blocking Raft liveness?

2021-02-10T09:56:20.960134Z -0.004 0   [info ] ../src/consensus/aft/raft.h:1089     | Recv append entries to 12 from 11 but our term is later (17 > 16)

11 is still appending from an old term. This instance seems harmless - 11 was the leader and they correct themselves quickly afterwards.

Node 11

Node 11 becomes a follower in 17, but with slighly odd surroundings:

2021-02-10T09:56:17.728388Z -0.013 0   [info ] ../src/consensus/aft/raft.h:1869     | Recv request vote response to 11: we aren't a candidate
2021-02-10T09:56:20.963741Z -0.005 0   [info ] ../src/consensus/aft/raft.h:2019     | Becoming follower 11: 17
2021-02-10T09:56:20.963932Z -0.005 0   [info ] ../src/consensus/aft/raft.h:1852     | Send request vote response from 11 to 12: false

I guess the request vote response is from a previous term, or perhaps even the election which made us leader? We should log the term of this response, and make it look less like an error if its a late (but valid) vote for us for an election we just won. I don't know why we're voting against 12 immediately after becoming a follower. Again, we should log the term this vote is about.

After that 12 goes to sleep, so 11 starts a new election as candidate:

2021-02-10T09:56:23.093814Z -0.001 0   [info ] ../src/consensus/aft/raft.h:1944     | Becoming candidate 11: 18
2021-02-10T09:56:23.094050Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1758     | Send request vote from 11 to 12
2021-02-10T09:56:23.094188Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1758     | Send request vote from 11 to 10

But node 10 votes against for some reason:

2021-02-10T09:56:23.149595Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1914     | Recv request vote response to 11 from 10: they voted no

We repeat this for several terms. We briefly become follower in 21, but this times out just like 17 did:

2021-02-10T09:56:31.062761Z -0.002 0   [info ] ../src/consensus/aft/raft.h:2019     | Becoming follower 11: 21
2021-02-10T09:56:31.063007Z -0.003 0   [info ] ../src/consensus/aft/raft.h:1852     | Send request vote response from 11 to 10: true
2021-02-10T09:56:34.274232Z -0.001 0   [info ] ../src/consensus/aft/raft.h:1944     | Becoming candidate 11: 22

This results in a very suspicious blob of late appends, which for some reason have reached us after we gave up on this term:

2021-02-10T09:56:34.274973Z -0.001 0   [info ] ../src/consensus/aft/raft.h:1089     | Recv append entries to 11 from 10 but our term is later (22 > 21)
2021-02-10T09:56:34.275068Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1089     | Recv append entries to 11 from 10 but our term is later (22 > 21)
<-SNIP->
2021-02-10T09:56:34.335386Z -0.062 0   [info ] ../src/consensus/aft/raft.h:1089     | Recv append entries to 11 from 10 but our term is later (22 > 21)
2021-02-10T09:56:34.335469Z -0.062 0   [info ] ../src/consensus/aft/raft.h:1089     | Recv append entries to 11 from 10 but our term is later (22 > 21)

Node 10

Node 10 is the log I have the most questions about.

Why does it continually become follower but vote against?

2021-02-10T09:56:23.124401Z -0.001 0   [info ] ../src/consensus/aft/raft.h:2019     | Becoming follower 10: 18
2021-02-10T09:56:23.124531Z -0.001 0   [info ] ../src/consensus/aft/raft.h:1852     | Send request vote response from 10 to 11: false

Are we queuing comms to 11 because it is somehow still unknown to us?

2021-02-10T09:56:20.978659Z -0.002 0   [info ] ../src/consensus/aft/raft.h:1089     | Recv append entries to 10 from 11 but our term is later (17 > 16)
2021-02-10T09:56:20.978756Z        100 [fail ] ../src/host/node_connections.h:357   | Unknown node connection 11

General thoughts

I think I understand the initial instability - node 10 has the latest, committed state (@17.138), which 12 wrote, but 11 has not yet seen. But 11 keeps calling elections first (due to lucky timeouts), which it can never win. This is known Raft pathology - you're going to have unsuccessful elections ~50% of the time while 12 is dead, but eventually 10 should win. That explains some election instability, but not why 10 is then deposed from term 21.

With additional logging, it looks like the problem is a term history mismatch.

On node 11:

2021-02-10T15:05:19.634167Z -0.003 0   [info ] ../src/consensus/aft/impl/state.h:38 | Updating view to: 17 at version: 151
2021-02-10T15:05:19.634223Z -0.003 0   [info ] ../src/consensus/aft/impl/state.h:55 | Resulting views: 1, 1, 22, 33, 44, 54, 65, 77, 88, 99, 110, 121, 130, 135, 140, 147, 151

Whereas on node 12:

2021-02-10T15:05:20.616682Z -0.008 0   [info ] ../src/consensus/aft/impl/state.h:38 | Updating view to: 17 at version: 153
2021-02-10T15:05:20.616739Z -0.008 0   [info ] ../src/consensus/aft/impl/state.h:55 | Resulting views: 1, 1, 22, 33, 44, 54, 65, 77, 88, 99, 110, 121, 130, 135, 140, 147, 153

And as a result, when node 12 received an append entries from 11:

2021-02-10T15:05:20.616925Z -0.008 0   [info ] ../src/consensus/aft/raft.h:1016     | Received pt: 17 pi: 151 t: 17 i: 151 toi: 17
2021-02-10T15:05:20.616984Z -0.008 0   [info ] ../src/consensus/aft/raft.h:1100     | Previous term for 151 should be 16
2021-02-10T15:05:20.617085Z -0.008 0   [info ] ../src/consensus/aft/raft.h:1122     | Recv append entries to 12 from 11 but our log at 151 has the wrong previous term (ours: 16, theirs: 17)

I'm now trying to trace exactly how each node has advanced its term history.

Looks like it is related to snapshots, as I've tried increasing the snapshot threshold from 20KB to 200MB, and it no longer occurs (still confirming this with a bunch more runs). Losing focus with these logs for the day, but something along the lines of "the snapshot means that the primary maintains a last-globally-committ[ed|able] value that is later than it should be, so they don't roll back far enough on elections"? Will investigate further tomorrow.

I have a more minimal/quicker local repro where the first phase of the test (adding a node and retiring the primary) is skipped, which means there are only 3 nodes during the lifetime of the service.

The issue seems to be that the new primary Node 1 in view 13 cannot make progress in a two-node network (previous primary Node 2 is still suspended. I'll try to explain the scenario below:

In term 9, the current primary (Node 2) is suspended, and Node 1 becomes primary:

10:34:38.803 | INFO     | infra.node:suspend:351 - Node 2 suspended...
10:34:38.803 | INFO     | infra.network:wait_for_new_primary:816 - Waiting up to 16.0s for a new primary (different from 2) to be elected...
10:34:50.643 | WARNING  | infra.network:find_primary:715 - Could not successfully connect to node 2. Retrying...
10:34:50.927 | INFO     | infra.network:wait_for_new_primary:827 - [1||] GET /node/network
10:34:50.927 | INFO     | infra.network:wait_for_new_primary:827 - 200 @13.50 {"current_view":13,"primary_id":1,"service_status":"OPEN","view_change_in_progress":false}

Note that it took 4 tries to elect a new leader in view 13:

Node 0 became candidate in view 10, Node 1 became follower in view 10 but responded false to the vote request because of the committable checks (Should Node 1 become follower if the consistency checks fail?)
Node 1 became candidate in view 11, Node0became follower in view11and respondedtrueto the vote request, soNode 1became leader in view11`.
In view 11, Node 1 sends append entries to Node 0 but these are rejected because of term consistency checks:

2021-02-11T10:34:42.992144Z -0.003 0   [debug] ../src/consensus/aft/raft.h:1016     | Received pt: 9 pi: 49 t: 11 i: 49 toi: 9
2021-02-11T10:34:42.992180Z -0.003 0   [debug] ../src/consensus/aft/raft.h:1100     | Previous term for 49 should be -9223372036854775808
2021-02-11T10:34:42.992210Z -0.004 0   [debug] ../src/consensus/aft/raft.h:1122     | Recv append entries to 0 from 1 but our log at 49 has the wrong previous term (ours: -9223372036854775808, theirs: 9)

// Node 0 hasn't got idx 49 so request 46+ instead

2021-02-11T10:34:42.995441Z -0.007 0   [debug] ../src/consensus/aft/raft.h:1016     | Received pt: 9 pi: 45 t: 11 i: 49 toi: 9
2021-02-11T10:34:42.995480Z -0.007 0   [debug] ../src/consensus/aft/raft.h:1100     | Previous term for 45 should be 8
2021-02-11T10:34:42.995511Z -0.007 0   [debug] ../src/consensus/aft/raft.h:1122     | Recv append entries to 0 from 1 but our log at 45 has the wrong previous term (ours: 8, theirs: 9)

Which means that Node 0 eventually times out and becomes candidate in view 12 and Node 1 becomes follower but Node 1 responds false to the vote request because of the committable checks.
Node 1 became candidate in view 13, Node0became follower in view13and respondedtrueto the vote request, soNode 1became leader in view13`.
At this point, the Python infra picks up Node 1 as the new leader after Node 2 has been suspended but the election bounce offs are not yet finished...
Indeed, the new follower Node 0 still rejects append entries from the new leader Node 1 because of a term mismatch:

2021-02-11T10:34:49.643714Z -0.007 0   [debug] ../src/consensus/aft/raft.h:1016     | Received pt: 11 pi: 50 t: 13 i: 50 toi: 11
2021-02-11T10:34:49.643766Z -0.007 0   [debug] ../src/consensus/aft/raft.h:1100     | Previous term for 50 should be -9223372036854775808
2021-02-11T10:34:49.643819Z -0.007 0   [debug] ../src/consensus/aft/raft.h:1122     | Recv append entries to 0 from 1 but our log at 50 has the wrong previous term (ours: -9223372036854775808, theirs: 11)

// Node 0 hasn't got idx 50 so request 46+ instead

2021-02-11T10:34:49.652434Z -0.008 0   [debug] ../src/consensus/aft/raft.h:1016     | Received pt: 9 pi: 45 t: 13 i: 50 toi: 11
2021-02-11T10:34:49.652507Z -0.008 0   [debug] ../src/consensus/aft/raft.h:1100     | Previous term for 45 should be 8
2021-02-11T10:34:49.652555Z -0.008 0   [debug] ../src/consensus/aft/raft.h:1122     | Recv append entries to 0 from 1 but our log at 45 has the wrong previous term (ours: 8, theirs: 9)
2021-02-11T10:34:49.652636Z -0.008 0   [debug] ../src/consensus/aft/raft.h:1442     | Send append entries response from 0 to 1 for index 45: 1

Because the view consistency checks fail, the network doesn't make progress.

As @eddyashton pointed out, it seems that this is a view consistency failure on the node remaining nodes (Node 0 and Node 1). Investigating where the inconsistency comes from now...

We now have consistent repros, a good idea of what's broken, and we're narrowing in on a potential fix.

The problem is in how we handle rollback, vs how Raft handles overwriting of inconsistent logs:

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction

If you receive an append entries, you should apply it to your log no questions asked (or blow up if you think something is actually broken). Our rollback behaviour tries to make this trivial - you've rolled back to what they might send you. But actually we don't rollback far enough. Specifically we only roll back to our last committable index. This is safe and necessary when we first become a follower:

If we were previously the primary, we should forget any unsigned state we had
As any node, our election high-water mark should use (advertise and compare against) the latest signature we know

But it is valid for a primary to be deposed, and its committable (but not committed) state to be lost by the service. When this old primary receives an append entries from the new primary, even though it voted against them, it should accept the new term and roll back to the append entries' index. The safety check here should be that we don't rollback committed state (in fact we should probably have this inside rollback, regardless of the calling reason), but it is safe to rollback committable state, which we currently don't do.

In summary: In become_follower we do rollback(last_committable_index());. This is necessary for the case where this is called from recv_request_vote_response. But we need to also call something like rollback(idx) in recv_append_entries, for when we've lost an election and need to remove our uncommitted suffix.

@eddyashton that makes sense to me, and I think the fix you suggest in recv_append_entries is correct too, with a minor tweak: upon receiving an append entries for a higher term, but I think we may want to roll back to min(our local opinion of the commit index, the new primary's commit index) rather than to the index of the received append entry, and only apply it if they're continuous.

Actually I don't think we want to roll back this blindly unless the terms differ, because we could lose persistence of the committable but not yet committed suffix. I think we want to keep all indices for which the term is compatible with the statement made by the latest append entry.

Some quick notes from a long discussion about this:

Tracking when need to rollback is non-trivial. It seems its not possible with the current state to make a local determination, when receiving an append entries, which are matching ones from terms you agree on and which should overwrite your suffix (/trigger a rollback).

The index that we ACK when responding to append entries is also 'too high'. Either we need to more cautiously track what is safe to ACK (where we currently optimistically report it is the last thing we know about), or have done the rollback before we build this response.

The difficulty is in the general case. An append entries can say "Here's the entries from 3.15 to 4.50, and I'm in term 4". I may have correct values from term 3, up to @3.25, but also have a suffix up to @3.40 which is no longer valid (the new primary did not have). We currently detect term changes within this batch (ie - the fact that 4 starts at @4.26) by replaying it and processing its signature entries. But we also skip replaying any entries we already have. We want the receiver to decide "roll back to 25, replay the batch from 26, and ack up to 50". This may need separate watermarks ("what's the absolute latest entry I have" vs "what's the latest thing I'm sure is correct" vs "what's the latest thing I've acked to the primary"), or we may need to be more pessimistic and rollback further.

Resolved in #2192. Summarising the problem again for posterity, with a hopefully elucidating diagram:

    a       b   c       d
1 | | | | | | | | | | | X
2 | | | | | | | O
3 | | | | | O
4 | | O
5 | | O

1 is the primary. Primary 1 has last told everyone that commit was at a. O marks the last committable index each node has received (for simplicity, all of the | transactions are committable as well). All of these Os have been ACKd, and the primary has received all of these ACKs. Given the current ACKs, the primary knows commit has advanced to b, and can start advertising this to clients. If primary 1 dies here, only nodes 2 or 3 can win the subsequent election, since the votes of 4 and 5 will get either over the majority (but neither would vote for 4 or 5). If 3 becomes a candidate, it proposes it will start its new term from b. 2 will vote against this, but 3 will win regardless. This is correct - to win an election, 3 must have the index that 1 knew was committed, even if 3 doesn't know it was committed. So the transactions at c and c-1 currently held by 2 are part of a dead suffix, along with everything between c and d - not enough nodes knew them before the primary dead, and they weren't saved by the elected node, so they are lost. When node 2 receives a message from 3 saying "I will be giving you indices from b" (ie - an append_entries with prev_idx == b), 2 must rollback to b to continue the persisted ledger.

A followup from this is described in #2197; we should retain these dead forks on disk if we can, rather than aggressively truncating them.

microsoft / CCF

rotation_test failure #2189

Infra

Node 12

Node 11

Node 10

General thoughts