Closed heidihoward closed 3 months ago
tla-rollback@81745 aka 20240212.42 vs main ewma over 20 builds from 81376 to 81732
This change makes TLC still accept rollback_safety.ndjson
while showing proper action names.
diff --git a/tla/consensus/Traceccfraft.tla b/tla/consensus/Traceccfraft.tla
index 89ff917e8..f6d7146f6 100644
--- a/tla/consensus/Traceccfraft.tla
+++ b/tla/consensus/Traceccfraft.tla
@@ -373,9 +373,11 @@ TraceNext ==
\/ IsExecuteAppendEntries
\/ IsRcvProposeVoteRequest
+ \/ /\ l = 191
+ /\ DropMessages \cdot IsRcvRequestVoteRequest
DropAndNext ==
- \/ IF ENABLED TraceNext THEN TraceNext ELSE DropMessages \cdot TraceNext
+ \/ TraceNext
\/ IsDropPendingTo
TraceSpec ==
Hypothesis why TV doesn't fail:
1) Validation in Traceccfraft is generally too weak, i.e., we have to come up with more things to assert about the trace (see @heidihoward _fake.ndjson above)
2) The json trace/raft scenario ends prematurely for TV to catch failures (some log lines such as execute_append_entries_sync
are noops in Traceccfraft).
3) QuorumLogInv
is a conjunct of TraceMatchesConstraints
, whereas it's commented in e.g. SIMccfraft.cfg
.
4) Occurrence of \cdot
in ccfraft
, combined with the incomplete implementation in TLC, may cause problems.
4a) The debugger's unsatisfied-next breakpoint and https://github.com/tlaplus/tlaplus/commit/c2713def38aa1d3255a6c2fe2aa037253e00e361 could both be incompatible with \cdot, causing states with a roll-backed log not to be in the model.
Emitting two more log lines by appending dispatch_one,2
to the scenario, and commenting/removing QuorumLogInv
adds states to the model where node #2's log has been rolled back:
Questions:
1) What non-determinism in ccfraft
allows node 2 not to roll back its log at l=214
?
=> AppendEntriesAlreadyDone
(@eddyashton: You mentioned a known discrepancy during today's standup?!)
2) What assertions should be added to Traceccfraft
to exclude states with the unchanged log?
=> state.last_idx
, i.e., the length of a node's log, is nowhere asserted.
main
with
Hypothesis why TV doesn't fail:
- Validation in Traceccfraft is generally too weak, i.e., we have to come up with more things to assert about the trace (see @heidihoward _fake.ndjson above)
- The json trace/raft scenario ends prematurely for TV to catch failures (some log lines such as
execute_append_entries_sync
are noops in Traceccfraft).QuorumLogInv
is a conjunct ofTraceMatchesConstraints
, whereas it's commented in e.g.SIMccfraft.cfg
.- Occurrence of
\cdot
inccfraft
, combined with the incomplete implementation in TLC, may cause problems. 4a) The debugger's unsatisfied-next breakpoint and tlaplus/tlaplus@c2713de could both be incompatible with \cdot, causing states with a roll-backed log not to be in the model.
main
). I suggest we continue strengthening TV by working on the various TODOs in Traceccfraft.tla
.
- turned out to be a red herring, and 2. and 3. are optional; 1. (Strengthen assertions in trace validation #6010) is what makes TV find the violation at state 212 (even in
main
). I suggest we continue strengthening TV by working on the various TODOs inTraceccfraft.tla
.
I take that back for now.
diff --git a/tla/consensus/Traceccfraft.tla b/tla/consensus/Traceccfraft.tla
index 1b69b4cb7..d9bbfece9 100644
--- a/tla/consensus/Traceccfraft.tla
+++ b/tla/consensus/Traceccfraft.tla
@@ -219,7 +219,8 @@ IsRcvAppendEntriesRequest ==
/\ commitIndex[logline.msg.state.node_id] = logline.msg.state.commit_idx
\* TODO /\ leadershipState[logline.msg.state.node_id] = ToLeadershipState[logline.msg.state.leadership_state]
/\ membershipState[logline.msg.state.node_id] \in ToMembershipState[logline.msg.state.membership_state]
- \*TODO/\ Len(log'[logline.msg.state.node_id]) = logline.msg.state.last_idx
+ /\ \/ Len(log[logline.msg.state.node_id]) = logline.msg.state.last_idx
+ \/ Len(log'[logline.msg.state.node_id]) = TraceLog[l-1].msg.state.last_idx
IsSendAppendEntriesResponse ==
\* Skip saer because ccfraft!HandleAppendEntriesRequest atomcially handles the request and sends the response.
@heidihoward Any reason to keep the tla-rollback
branch around?
Nope, fine to delete
@lemmy so what was the root cause in the end?
Not enough/too weak assertions.
Using the CI to test trace validation and spec.
This PR is not for merging, just for experimenting with