Closed achamayou closed 4 months ago
generate_scenario_from_trace@80033 aka 20240109.7 vs main ewma over 20 builds from 79781 to 80028
Some discoveries:
core dump
when, e.g., asserting its state@lemmy apologies for editing the comment, but adding the collapsible blocks makes the PR much more readable.
The first issue is that the mapping ignores ChangeConfiguration
, because the (unreasonably optimistic) expectation was that it would translate ChangeConfigurationInt
instead, given that:
\E newConfiguration \in SUBSET(Servers \ removedFromConfiguration) :
ChangeConfigurationInt(i, newConfiguration)
But ChangeConfigurationInt
doesn't seem to make it to the trace, I guess because it's a function called by the action and not the action itself. That's annoying, because logging ChangeConfiguration
alone and its parameter i
is obviously insufficient. I guess we need to diff the pre and post states to find the new configuration in this case.
The second issue is that we should use trust_node
, rather than replicate_new_configuration
, since that does the creation and trusting, that's my mistake.
The third issue is that this performing complete replacement, not just a node addition, and that we don't have good way to do that in the driver now.
Found another mismatch, before even getting to the reconfig:
The generated assert for the end of state 4 is for commit index 2 on n1:
start_node,n1
emit_signature,2
# MCClientRequest MCccfraft:65
replicate,2,42
assert_commit_idx,n1,2
# MCSignCommittableMessages MCccfraft:73
emit_signature,2
assert_commit_idx,n1,2
But it is 4 in the implementation, since 4 is committable and the whole config ({n1}) is aware. Which now that I think about is exactly what #5798 uncovered.
@lemmy I've fixed reconfiguration by tweaking replicate_new_configuration
to create nodes when they are missing rather than throw, and added logic that fetches the last config for the node from the post state. That seems work ok on a scenario similar to the one you pasted, but the issue with commit asserts remains, since the spec just behaves differently.
1) Simulate SIMccfraft
2) Convert with trace2scen.py
3) Feed into ~raft_driver
~ raft_scenarios_runner.py
(former no longer produces ndjsons that can be fed into Traceccfraft
, compare https://github.com/microsoft/CCF/commit/a6384ef20f7ff46aa96254c9d208d00ceb830f12#r135809467 and https://github.com/microsoft/CCF/commit/7230570efe3986ce4d93e945b062d1459aa47a37#r135808748)
4) Dump raft_trace
to ndjson
5) Feed into Traceccfraft
root@codespaces-a47fa6:/workspaces/CCF/tla/consensus# TS=$(date +%s) && \
../tlc.sh -note SIMccfraft.tla -simulate num=1 -dumptrace json SIMccfraft-$TS.json > SIMccfraft-$TS.out ; \
python3 ../trace2scen.py SIMccfraft-$TS.json > SIMccfraft-$TS.rt ; \
python3 ../raft_scenarios_runner.py ../../build/raft_driver SIMccfraft-$TS.rt | tee >(grep '"tag":"raft_trace"' > SIMccraft-$TS.ndjson) && \
JSON=SIMccraft-$TS.ndjson ../tlc.sh -note Traceccfraft.tla
diff --git a/tla/consensus/SIMccfraft.cfg b/tla/consensus/SIMccfraft.cfg
index b920aaba8..e4375a917 100644
--- a/tla/consensus/SIMccfraft.cfg
+++ b/tla/consensus/SIMccfraft.cfg
@@ -90,3 +90,4 @@ INVARIANTS
\* DebugInvSuccessfulCommitAfterReconfig
\* DebugInvAllMessagesProcessable
\* DebugInvRetirementReachable
+ DebugUpToDepth
diff --git a/tla/consensus/SIMccfraft.tla b/tla/consensus/SIMccfraft.tla
index bfacc4e7f..e14af7924 100644
--- a/tla/consensus/SIMccfraft.tla
+++ b/tla/consensus/SIMccfraft.tla
@@ -39,7 +39,7 @@ SIMNext ==
[] OTHER -> IF ENABLED Forward THEN Forward ELSE Next
SIMSpec ==
- Init /\ [][SIMNext]_vars
+ Init /\ [][Next]_vars \* Next and not SIMNext to get proper action names & params.
\* The state constraint StopAfter stops TLC after the alloted
\* time budget is up, unless TLC encounteres an error first.
@@ -48,6 +48,8 @@ StopAfter ==
(* The smoke test has a time budget of 20 minutes. *)
IN TLCSet("exit", TLCGet("duration") > timeout)
+DebugUpToDepth ==
+ TLCGet("level") < TLCGet("config").depth
=============================================================================
------------------------------- MODULE SIMPostCondition -------------------------------
diff --git a/tla/trace2scen.py b/tla/trace2scen.py
index d4e9ddb66..2811a5c09 100644
--- a/tla/trace2scen.py
+++ b/tla/trace2scen.py
@@ -32,6 +32,12 @@ MAP = {
"RejectAppendEntriesRequest": lambda _, __, ___: ["# Noop"],
"ReturnToFollowerState": lambda _, __, ___: ["# Noop"],
"AppendEntriesAlreadyDone": lambda _, __, ___: ["# Noop"],
+ "RcvUpdateTerm": lambda _, __, ___: ["# Noop"],
+ "RcvAppendEntriesRequest": lambda _, __, ___: ["# Noop"],
+ "RcvAppendEntriesResponse": lambda _, __, ___: ["# Noop"],
+ "RcvDropIgnoredMessage": lambda _, __, ___: ["# Noop"],
+ "RcvRequestVoteRequest": lambda _, __, ___: ["# Noop"],
+ "RcvRequestVoteResponse": lambda _, __, ___: ["# Noop"],
}
def post_commit(post):
The first issue is that the mapping ignores
ChangeConfiguration
, because the (unreasonably optimistic) expectation was that it would translateChangeConfigurationInt
instead, given that:\E newConfiguration \in SUBSET(Servers \ removedFromConfiguration) : ChangeConfigurationInt(i, newConfiguration)
But
ChangeConfigurationInt
doesn't seem to make it to the trace, I guess because it's a function called by the action and not the action itself. That's annoying, because loggingChangeConfiguration
alone and its parameteri
is obviously insufficient. I guess we need to diff the pre and post states to find the new configuration in this case.
This (syntactic) change (now part of https://github.com/microsoft/CCF/pull/5880) causes ChangeConfigurationInt
with parameters i
and newConfiguration
to be included in the trace.
diff --git a/tla/consensus/ccfraft.tla b/tla/consensus/ccfraft.tla
index 7d5fae84c..af1eefd69 100644
--- a/tla/consensus/ccfraft.tla
+++ b/tla/consensus/ccfraft.tla
@@ -635,6 +635,8 @@ SignCommittableMessages(i) ==
ChangeConfigurationInt(i, newConfiguration) ==
\* Only leader can propose changes
/\ state[i] = Leader
+ \* Configuration is not a previously removed configuration.
+ /\ newConfiguration \notin removedFromConfiguration
\* Configuration is non empty
/\ newConfiguration /= {}
\* Configuration is a proper subset of the Servers
@@ -657,7 +659,7 @@ ChangeConfigurationInt(i, newConfiguration) ==
leaderVars, commitIndex, committableIndices>>
ChangeConfiguration(i) ==
- \E newConfiguration \in SUBSET(Servers \ removedFromConfiguration) :
+ \E newConfiguration \in SUBSET(Servers) :
ChangeConfigurationInt(i, newConfiguration)
\* Leader i advances its commitIndex to the next possible Index.
raft_driver
)?s
beyond what is serialized in the json output.
=> Workaround: One can simply run the generated raft scenario up to the point where the impl is in the desired state.Traceccfraft.tla
part raft.h > Traceccfraft.tla
provides the global system perspective, whereas the raft.h debugger provides a node-local perspective. (Question: How do we debug multiple nodes with raft_driver
?)It appears as if a main benefit of co-debugging/lockstep debugging is locked away by the fact that the driver cannot reverse the impl. Otherwise, one could interactively explore the state space at every state of a behavior.
1) Emit the next raft scenario command
a) When generating a complete scenario from a TLA+ trace
b) One by one when creating a TLA+ trace with e.g. the TLA+ debugger
2) Validate the resulting json output
a) With Traceccfraft.tla
after the json log has been serialized
b) Interactively, when moving to the next state with e.g. the TLA+ debugger
"Batch" is trivial: Write trace and print raft scenario command with an Alias or a (Python) script. Verify the json output with existing Traceccfraft.tla
. Interactive is more challenging: Emit 1b) with TLCDefer
as part of the next-state relation, and verify 2b in an action-constraint.
-dumptrace
DS
be the debug (ccfraft) spec and TS
the Traceccfraft
spec. A HyperSpec defines Next == DS!Next /\ TS!Next
, where DS!Next
emits the next raft scenario command, and TS!Next
reads the json output of the raft driver. @lemmy thank you for making the change in #5880, I've update the conversion, and also added the missing actions you spotted when running the sim. It's a little annoying that the default SIM traces don't work, and that changing the Next is necessary, but the improved coverage is desirable.
(Question: How do we debug multiple nodes with raft_driver?)
raft_driver is single threaded, nodes are separate objects but all execution is serialised.
It appears as if a main benefit of co-debugging/lockstep debugging is locked away by the fact that the driver cannot reverse the impl.
Agreed, though as discussed before the new year, making the driver able to reverse the state of each node is probably achievable: nearly all of it already supports trivial json ser/deser. We would also need to put the messages back on the queues, and restore time elapsed. I don't know that we want to go much further before we complete #5822 at least.
For now, the intermediate result this gives us is: "A validation trace can (mostly) be fed back to the driver for investigation".
Can we get traces from the existing (matching) scenarios, and see what scenarios this generates from them?
Can we get traces from the existing (matching) scenarios, and see what scenarios this generates from them?
We have only two scenarios compatible with the new bootstrapping logic. @achamayou does it make sense to port all deprecated scenarios?
Can we get traces from the existing (matching) scenarios, and see what scenarios this generates from them?
We have only two scenarios compatible with the new bootstrapping logic. @achamayou does it make sense to port all deprecated scenarios?
I've been trying to do that (#5839), but have not found enough time so far.
The traces won't match, because all information related to message dispatch and periodic/time elapsed is lost right now. I don't think we can get them to match without substantially reworking either the spec or the driver, as discussed.
Very basic script, using functionality introduced by https://github.com/tlaplus/tlaplus/issues/853, try and translate a trace to a raft scenario, for the purpose of executing counter-examples on the implementation.
Example run, using the following spec break:
It's not straightforward to do this correctly, because the spec is set up differently from the driver: individual message delivery and consumption is contained in steps, whereas the driver has a separate dispatch action. This difference in granularity means that we probably can't reproduce every scenario precisely, but we can get quite close most of the time still.