microsoft / CCF

Confidential Consortium Framework
https://microsoft.github.io/CCF/
Apache License 2.0
764 stars 205 forks source link

Consensus trace validation #5057

Closed achamayou closed 2 months ago

achamayou commented 1 year ago

Following a suggestion from @lemmy, we could set up trace validation for the CCF consensus implementation.

../src/consensus/aft/impl/state.h
../src/consensus/aft/raft.h
../src/ds/state_machine.h
../src/node/history.h
../src/kv/store.h
../src/node/node_state.h
lemmy commented 1 year ago

As @eddyashton pointed out, to work around the limitations of timestamps, we want to initially focus on the sequentially executing raft_scenarios.

The table shows the TLA+ actions and the updated (non-auxiliary) variables of ccfraft.tla. It indicates what state the trace locations have to be emitted.

Action state matchIndex' messages' log' state' configurations' commitIndex' commitsNotified' currentTerm' votedFor' votesGranted'
Timeout Follower, Candidate x (Candidate) x x x
ChangeConfiguration Leader x x
CheckQuorum Leader x (Follower)
ClientRequest Leader x (TypeEntry)
BecomeLeader Candidate x x x x
SignCommittableMessages Leader x (TypeSignature)
AdvanceCommitIndex Leader x x x
RequestVote Candidate snd
NotifyCommit RetiredLeader snd x
AppendEntries Leader snd
RcvDropIgnoredMessage any rcv
RcvUpdateTerm any rcv x x x
RcvAppendEntriesRequest any rcv x x x x
RcvRequestVoteRequest any rcv x
RcvRequestVoteResponse any rcv x
RcvAppendEntriesResponse any x rcv
RcvUpdateCommitIndex any rcv x x
lemmy commented 1 year ago

@eddyashton suggested that post-processing the existing logging with a Python script is a better approach because it can also be used in production. This raises the question of whether debug and trace log statements can be included in trace validation, as they may not be included in a production build.

Started hacking at https://github.com/lemmy/CCF/tree/mku-gh5057

lemmy commented 1 year ago

In the tests/raft_scenarios/4582.2 file, there is an issue where raft.h allows a pending node, named n1, to become a follower in response to receiving an AppendEntries message. However, this transition occurs even if n1's log is not up-to-date and is missing the new configuration that includes n1. Consequently, n1 responds to the message with an AppendEntriesResponseType::FAIL.

This behavior is not consistent with ccfraft.tla, which specifies that a pending node n1 can transition to a follower state iff log[n1] contains a configuration that includes n1 as a member.

Independently, ccfraft.tla does not model AppendEntries to contain multiple log entires. For trace validation, this might have to change.

lemmy commented 1 year ago

tests/raft_scenarios/replicate reveals that ccfraft.tla does not directly model the term update when a RaftMsgType::raft_request_vote message's term is greater than the node's current term:

https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L1716-L1725 vs https://github.com/microsoft/CCF/blob/main/tla/ccfraft.tla#L679

Note that UpdateTerm does not remove a message from messages and that become_aware_of_new_term is the equivalent to the spec's UpdateTerm action. Thus, the spec models the above scenario by an UpdateTerm action followed by a RcvRequestVoteRequest action.

lemmy commented 1 year ago

tests/raft_scenarios/replicate reveals that ccfraft.tla does not directly model the term update when a RaftMsgType::raft_request_vote message's term is greater than the node's current term:

https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L1716-L1725 vs https://github.com/microsoft/CCF/blob/main/tla/ccfraft.tla#L679

Note that UpdateTerm does not remove a message from messages and that become_aware_of_new_term is the equivalent to the spec's UpdateTerm action. Thus, the spec models the above scenario by an UpdateTerm action followed by a RcvRequestVoteRequest action.

Addressed with the help of https://github.com/tlaplus/tlaplus/tree/mku-cdot in https://github.com/lemmy/CCF/commit/4d253100a6b53c6729060747bc08bac9bd35ae7e

lemmy commented 1 year ago

The replicate scenario in the tests/raft_scenarios/replicate folder of the CCF repository exposes an issue in the ccfraft.tla file, specifically in line 457. This problem prevents the newly elected leader from sending an empty AppendEntries message after its election, as it is supposed to do. Currently, the only relevant enabled action in ccfraft.tla is ClientRequest, which advances the leader's log beyond nextIndex and enables AppendEntries. However, waiting for the first client request before announcing the election is not correct.

The wording "Send up to 1 entry, constrained by the end of the log." of ccfraft!AppendEntries indicates that it should be possible to send an empty AppendEntries message. Removing line 457 does not cause any violations in the simulation (SIMccraft) or exhaustive model checking (MCccfraft and MCccraftWithReconfig) but increases the specification's state space.

Depends on https://github.com/tlaplus/tlaplus/pull/805

lemmy commented 1 year ago

Here is one more discrepancy found by partially matching tests/raft_scenarios/4582.2. Action ccfraft!NoConflictAppendEntriesRequest does not permit a node n to become a follower if n is not a member of any configuration. Yet, the implementation permits node 1 to become a follower even though the node is not a member of any configuration (note "configurations":[] in corresponding log statement below obtained with https://github.com/lemmy/CCF/commit/fdf714070f9b5e97e0a65dc0b9a3edaa13f10ec1).

{"h_ts":"2023-04-03T18:31:13.652461Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"2165","msg":{"configurations":[],"event":{"component":"raft","function":"become_follower"},"leadership":"Follower","membership":"Active","node":"1","oldleadership":"none","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"1"}}}
eddyashton commented 1 year ago

Here is one more discrepancy found by partially matching tests/raft_scenarios/4582.2. Action ccfraft!NoConflictAppendEntriesRequest does not permit a node n to become a follower if n is not a member of any configuration. Yet, the implementation permits node 1 to become a follower even though the node is not a member of any configuration (note "configurations":[] in corresponding log statement below obtained with lemmy@fdf7140).

{"h_ts":"2023-04-03T18:31:13.652461Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"2165","msg":{"configurations":[],"event":{"component":"raft","function":"become_follower"},"leadership":"Follower","membership":"Active","node":"1","oldleadership":"none","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"1"}}}

I think the implementation behaviour is correct and necessary here. A node which joins without a snapshot starts with no knowledge of configurations, yet must be able to become Follower (or init as a Follower) and process incoming AppendEntries. Some other node considers them part of a configuration, and needs to bring them up to speed.

This directly contradicts a comment in ccfraft.tla:

A node's configuration is never "empty"

I think we need to update the spec to match the implementation. Either permit the change to Follower when the configuration is locally-unknown, or add an earlier transition to Follower (which is what happens in practice). I wonder what the model thinks happens here, because I guess its maybe possible that once you've become follower, you become candidate and elect yourself primary, with an empty KV?

I believe we don't currently model joining from a snapshot, where the comment is true (after parsing a valid snapshot, a node has at least some configuration, likely stale).

lemmy commented 1 year ago

Heads up: it seems that the trace validation has covered most of the "happy path", as yesterday's successful validation of tests/raft_scenarios/election right out of the box!

lemmy commented 1 year ago

election_while_reconfiguration (and reconfiguration) also detects the spec<>code divergence found in https://github.com/microsoft/CCF/issues/5057#issuecomment-1494791990.

Startup:

{"h_ts":"2023-04-06T18:01:30.548291Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"522","msg":{"configurations":[{"idx":0,"nodes":{"0":{"address":":"}},"rid":0}],"event":{"component":"raft","function":"add_configuration"},"leadership":"none","membership":"Active","node":"0","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":0,"last_idx":0,"my_node_id":"0"}}}
{"h_ts":"2023-04-06T18:01:30.548655Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"2031","msg":{"configurations":[{"idx":0,"nodes":{"0":{"address":":"}},"rid":0}],"event":{"component":"raft","function":"become_candidate"},"leadership":"Candidate","membership":"Active","node":"0","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"0"}}}
{"h_ts":"2023-04-06T18:01:30.548842Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"2099","msg":{"configurations":[{"idx":0,"nodes":{"0":{"address":":"}},"rid":0}],"event":{"component":"raft","function":"become_leader"},"leadership":"Leader","membership":"Active","node":"0","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"0"}}}

Leader sends an empty AppendEntries followed by a non-empty AppendEntries message:

{"h_ts":"2023-04-06T18:01:30.549822Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"522","msg":{"configurations":[{"idx":0,"nodes":{"0":{"address":":"}},"rid":0},{"idx":1,"nodes":{"0":{"address":":"},"1":{"address":":"}},"rid":1}],"event":{"component":"raft","function":"add_configuration"},"leadership":"Leader","membership":"Active","node":"0","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"0"}}}
{"h_ts":"2023-04-06T18:01:30.549998Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"1255","msg":{"event":{"component":"raft","function":"send_authenticated"},"leadership":"Leader","membership":"Active","mentries":0,"node":"0","paket":{"contains_new_view":false,"leader_commit_idx":0,"msg":0,"prev_term":0,"term":1,"term_of_idx":0},"state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"0"},"to":"1","type":1}}
{"h_ts":"2023-04-06T18:01:30.550149Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/test/driver.h","number":"62","msg":{"data":"eyJkYXRhIjoiZXlJd0lqcDdJbUZrWkhKbGMzTWlPaUk2SW4wc0lqRWlPbnNpWVdSa2NtVnpjeUk2SWpvaWZYMD0iLCJ0eXBlIjoicmVjb25maWd1cmF0aW9uIn0=","event":{"component":"ledger","function":"append"},"index":1,"node":"0","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":1,"my_node_id":"0"},"term":1}}
{"h_ts":"2023-04-06T18:01:30.550384Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"1255","msg":{"event":{"component":"raft","function":"send_authenticated"},"leadership":"Leader","membership":"Active","mentries":1,"node":"0","paket":{"contains_new_view":false,"leader_commit_idx":0,"msg":0,"prev_term":0,"term":1,"term_of_idx":1},"state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":1,"my_node_id":"0"},"to":"1","type":1}}

Receipt of empty AppendEntries and transition to follower state:

{"h_ts":"2023-04-06T18:01:30.550543Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"923","msg":{"event":{"component":"raft","function":"recv_append_entries"},"from":"0","leadership":"none","membership":"Active","node":"1","paket":{"contains_new_view":false,"leader_commit_idx":0,"msg":0,"prev_term":0,"term":1,"term_of_idx":0},"state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":0,"last_idx":0,"my_node_id":"1"},"type":0}}
{"h_ts":"2023-04-06T18:01:30.550700Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/test/driver.h","number":"73","msg":{"event":{"component":"ledger","function":"truncate"},"index":0,"node":"1","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"1"}}}
{"h_ts":"2023-04-06T18:01:30.550852Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"2166","msg":{"configurations":[],"event":{"component":"raft","function":"become_follower"},"leadership":"Follower","membership":"Active","node":"1","oldleadership":"none","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"1"}}}
{"h_ts":"2023-04-06T18:01:30.551132Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"1647","msg":{"event":{"component":"raft","function":"send_append_entries_response"},"leadership":"Follower","membership":"Active","node":"1","paket":{"last_log_idx":0,"msg":1,"success":0,"term":1},"state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"1"},"to":"0","type":1}}

Receipt of second, non-empty AppendEntries:

{"h_ts":"2023-04-06T18:01:30.551263Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"923","msg":{"event":{"component":"raft","function":"recv_append_entries"},"from":"0","leadership":"Follower","membership":"Active","node":"1","paket":{"contains_new_view":false,"leader_commit_idx":0,"msg":0,"prev_term":0,"term":1,"term_of_idx":1},"state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":0,"my_node_id":"1"},"type":0}}
{"h_ts":"2023-04-06T18:01:30.551568Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"522","msg":{"configurations":[{"idx":1,"nodes":{"0":{"address":":"},"1":{"address":":"}},"rid":1}],"event":{"component":"raft","function":"add_configuration"},"leadership":"Follower","membership":"Active","node":"1","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":1,"my_node_id":"1"}}}
{"h_ts":"2023-04-06T18:01:30.551711Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/test/driver.h","number":"62","msg":{"data":"eyJkYXRhIjoiZXlJd0lqcDdJbUZrWkhKbGMzTWlPaUk2SW4wc0lqRWlPbnNpWVdSa2NtVnpjeUk2SWpvaWZYMD0iLCJ0eXBlIjoicmVjb25maWd1cmF0aW9uIn0=","event":{"component":"ledger","function":"append"},"index":1,"node":"1","state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":1,"my_node_id":"1"},"term":1}}
{"h_ts":"2023-04-06T18:01:30.551962Z","thread_id":"100","level":"info","tag":"tla","file":"../src/consensus/aft/raft.h","number":"1647","msg":{"event":{"component":"raft","function":"send_append_entries_response"},"leadership":"Follower","membership":"Active","node":"1","paket":{"last_log_idx":1,"msg":1,"success":0,"term":1},"state":{"cft_watermark_idx":0,"commit_idx":0,"current_view":1,"last_idx":1,"my_node_id":"1"},"to":"0","type":1}}
...

The problem is that "become_follower" log event is triggered by the first, empty message, whereas the "add_configuration" event by the follower is prompted by the subsequent message. Removing the conjunct (https://github.com/microsoft/CCF/blob/main/tla/ccfraft.tla#L796) that prevents a pending node from becoming follower despite its configurations does not fully address the problem, because ccfraft!InitReconfigurationVars does not accept initial states with an empty configuration. Related:

This directly contradicts a comment in ccfraft.tla:

A node's configuration is never "empty"

I think we need to update the spec to match the implementation. Either permit the change to Follower when the configuration is locally-unknown, or add an earlier transition to Follower (which is what happens in practice). I wonder what the model thinks happens here, because I guess its maybe possible that once you've become follower, you become candidate and elect yourself primary, with an empty KV?

I believe we don't currently model joining from a snapshot, where the comment is true (after parsing a valid snapshot, a node has at least some configuration, likely stale).

~The requirement to prevent a node from becoming a follower without a configuration was introduced in https://github.com/microsoft/CCF/issues/4806 to keep it from becoming a primary/leader.~

lemmy commented 1 year ago

Fully matched tests/raft_scenarios/check_quorum, tests/raft_scenarios/reconnect, and tests/raft_scenarios/reconnect_node.

achamayou commented 1 year ago

The requirement to prevent a node from becoming a follower without a configuration was introduced in https://github.com/microsoft/CCF/issues/4806 to keep it from becoming a primary/leader.

I don't think I understand the logical connection between preventing a node from becoming a follower without a config to becoming a leader without config. The latter is impossible in a real system, and should be impossible with the driver too. The former is necessary for every new joiner node (ie. every node except for the start node, which has a hardcoded config of itself).

A real joiner will either:

  1. Have a stale config from a snapshot.
  2. Have no config at all.

In either case, upon seeing an AppendEntries, the new joiner needs to transition to Follower and cannot expect to know the current primary.

lemmy commented 1 year ago

The requirement to prevent a node from becoming a follower without a configuration was introduced in #4806 to keep it from becoming a primary/leader.

I don't think I understand the logical connection between preventing a node from becoming a follower without a config to becoming a leader without config. The latter is impossible in a real system, and should be impossible with the driver too. The former is necessary for every new joiner node (ie. every node except for the start node, which has a hardcoded config of itself).

A real joiner will either:

  1. Have a stale config from a snapshot.
  2. Have no config at all.

In either case, upon seeing an AppendEntries, the new joiner needs to transition to Follower and cannot expect to know the current primary.

Thanks for the correction. I updated my previous comment.

lemmy commented 1 year ago

Fully matched tests/raft_scenarios/bad_network.

lemmy commented 1 year ago

With a strengthened matching of AppendEntriesRequest and AppendEntriesResponse messages (below), tests/raft_scenario/bad_network no longer validates. The problem is that the implementation's prev_idx increments after messages are dropped and resend (line 30-33), while ccfraft.tla does not seem to even model sent_idx.

IsAppendEntriesRequest(msg, dst, src, logline) ==
    /\ msg.mtype = AppendEntriesRequest
    /\ msg.mtype = RaftMsgType[logline.msg.paket.msg + 1]
    /\ msg.mdest   = dst
    /\ msg.msource = src
    /\ msg.mterm = logline.msg.paket.term
    /\ msg.mcommitIndex = logline.msg.paket.leader_commit_idx
    /\ msg.mprevLogTerm = logline.msg.paket.prev_term
    /\ Len(msg.mentries) = logline.msg.paket.idx - logline.msg.paket.prev_idx
    /\ msg.mprevLogIndex = logline.msg.paket.prev_idx   \* < NEW

IsAppendEntriesResponse(msg, dst, src, logline) ==
    /\ msg.mtype = AppendEntriesResponse
    /\ msg.mtype = RaftMsgType[logline.msg.paket.msg + 1]
    /\ msg.mdest   = dst
    /\ msg.msource = src
    /\ msg.mterm = logline.msg.paket.term
    \* raft_types.h enum AppendEntriesResponseType
    /\ msg.msuccess = (logline.msg.paket.success = 0)   \* < NEW
    /\ msg.mmatchIndex = logline.msg.paket.last_log_idx   \* < NEW

A partial fix to the problem provides the following patch:

diff --git a/tla/ccfraft.tla b/tla/ccfraft.tla
index 1f8a1abd9..c7e0823fa 100644
--- a/tla/ccfraft.tla
+++ b/tla/ccfraft.tla
@@ -508,7 +508,11 @@ AppendEntries(i, j) ==
        /\ \E b \in AppendEntriesBatchsize(i, j):
             /\ InMessagesLimit(i, j, b)
             /\ Send(msg(b))
-    /\ UNCHANGED <<reconfigurationVars, commitsNotified, serverVars, candidateVars, leaderVars, logVars>>
+            \* Record the most recent index we have sent to this node.
+            \* (see https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L968)
+            \* Inc by 1 to account for the fact that we send the next index.
+            /\ nextIndex' = [nextIndex EXCEPT ![i][j] = max(@, lastEntry(b) + 1)]
+    /\ UNCHANGED <<reconfigurationVars, commitsNotified, serverVars, candidateVars, matchIndex, logVars>>

 \* Candidate i transitions to leader.
 BecomeLeader(i) ==
@@ -740,7 +744,7 @@ RejectAppendEntriesRequest(i, j, m, logOk) ==
     /\ Reply([mtype           |-> AppendEntriesResponse,
               mterm           |-> currentTerm[i],
               msuccess        |-> FALSE,
-              mmatchIndex     |-> 0,
+              mmatchIndex     |-> m.mcommitIndex, \* See response_idx of send_append_entries_response in raft.h (https://github.com/microsoft/CCF/blob/main/src/consensus/aft/raft.h#L1321)
               msource         |-> i,
               mdest           |-> j],
               m)
@@ -861,7 +865,7 @@ HandleAppendEntriesResponse(i, j, m) ==
           /\ matchIndex' = [matchIndex EXCEPT ![i][j] = m.mmatchIndex]
        \/ /\ \lnot m.msuccess \* not successful
           /\ nextIndex' = [nextIndex EXCEPT ![i][j] =
-                               Max({nextIndex[i][j] - 1, 1})]
+                               Max({m.mmatchIndex, matchIndex[i][j]}) + 1]
           /\ UNCHANGED matchIndex
     /\ Discard(m)
     /\ UNCHANGED <<reconfigurationVars, messagesSent, commitsNotified, serverVars, candidateVars, logVars>>
lemmy commented 1 year ago

Related PRs:

https://github.com/microsoft/CCF/issues/5676 https://github.com/microsoft/CCF/pull/5250 https://github.com/microsoft/CCF/pull/5250 https://github.com/microsoft/CCF/pull/5238 https://github.com/microsoft/CCF/pull/5233 https://github.com/microsoft/CCF/pull/5222 https://github.com/microsoft/CCF/pull/5201 https://github.com/microsoft/CCF/pull/5200 https://github.com/microsoft/CCF/pull/5193 https://github.com/microsoft/CCF/pull/5187 https://github.com/microsoft/CCF/pull/5154 https://github.com/microsoft/CCF/pull/5150 https://github.com/microsoft/CCF/issues/4473

(bookkeeping)

lemmy commented 9 months ago

Further progress blocked by https://github.com/microsoft/CCF/issues/5220

achamayou commented 4 months ago

5220 is complete, next is merging #5919 (support for full reconfiguration in the spec), and fixing the liveness issue it has raised in the implementation (investigation in progress in #5951, more detail to be added in a GH issue once confirmed).

I suspect the next steps after that will consist of:

  1. Improving the scenario coverage, still using the driver.
  2. Improving the performance of the model checking to be able to run it with more interesting constraints

It's unlikely that we will look at validation of end to end scenarios before we have completed these two items.

lemmy commented 4 months ago

2. Improving the performance of the model checking to be able to run it with more interesting constraints

In-progress and pending work on TLC is tracked at https://github.com/tlaplus/tlaplus/pull/867

achamayou commented 4 months ago

Now that #5919 is in, I was able to move one of the last two existing scenarios left that needed retirement in #5967.

5925 is the only pre-existing scenario that isn't trace validated yet (because of the known absence of is_new_follower in the spec). The new candidate viability test in #5935 is pending the TLC change described above.

Post-liveness fix, or perhaps as part of it, we should expand reconfiguration scenario coverage. I think that will bring us to a point where we have fairly decent trace validation, and we will want to look at model checking more thoroughly again.

achamayou commented 3 months ago

All of the above except #5935 is now complete, validation work has moved to the following two issues:

  1. Liveness during node retirement
  1. Safety during post-election rollback (see #5927)

This is not currently thought to require spec changes, because the approach adopted is to remove eager rollback and leave the NACK divergence with Raft unchanged.

achamayou commented 2 months ago

I've removed end to end trace validation from this, let's plan and track it separately.

We have gotten to the point where all the scenarios pass in CI, including reconfiguration. We've gone from 15 scenarios (~1000 lines) when we opened this ticket (using a substantially different and simplified reconfiguration mechanism) to 28 now (~2300 lines) using realistic reconfiguration. With #6016 and #5973 we have also got to a point where we have no known issues in the implementation (*).

While it would be interesting to get formal coverage information from the spec, and there will always be opportunities to add new scenarios, I believe the current coverage is good enough to call this complete. I think we should now look at optimising the spec to improve model checking coverage.

(*) Which isn't to say we won't find more, or that there aren't optimisation opportunities.