microsoft / CCF

Confidential Consortium Framework
https://microsoft.github.io/CCF/
Apache License 2.0
784 stars 212 forks source link

Invariants violated in TLA+ reduced state specification #3828

Closed heidihoward closed 2 years ago

heidihoward commented 2 years ago

Describe the bug TLC found a violation of the LeaderCompletenessInv invariant when checking the current TLA+ specifcation in https://github.com/microsoft/CCF/blob/main/tla/reduced_raft_spec/ccfraft.tla.

To Reproduce Use TLC to check the current spec in https://github.com/microsoft/CCF/blob/main/tla/reduced_raft_spec/MCraft.tla using a term limit of 3 and request limit of 2.

Expected behavior TLC should have completed checking without finding a safety violation.

Environment information Host: Azure standard_HB120rs_v3 OS: Ubuntu 20.04 LTS TLC: TLC2 Version 2.17 of 02 February 2022

Additional context The final counter example provided by TLC is copied below:

Progress(37) at 2022-05-10 15:45:51: 1,356,436,766 states generated (13,140,222 s/min), 198,360,645 distinct states found (1,857,244 ds/min), 46,662,968 states left on queue.
Error: Invariant LeaderCompletenessInv is violated.
....
State 35: <AdvanceCommitIndex line 314, col 5 to line 340, col 103 of module ccfraft>
/\ committedLogDecrease = FALSE
/\ messages = { [ mtype |-> AppendEntriesResponse,
    mterm |-> 2,
    msource |-> NodeTwo,
    mdest |-> NodeOne,
    msuccess |-> TRUE,
    mmatchIndex |-> 2 ],
  [ mtype |-> AppendEntriesResponse,
    mterm |-> 2,
    msource |-> NodeThree,
    mdest |-> NodeOne,
    msuccess |-> TRUE,
    mmatchIndex |-> 2 ] }
/\ clientRequests = 3
/\ matchIndex = ( NodeOne :> (NodeOne :> 0 @@ NodeTwo :> 1 @@ NodeThree :> 1) @@
  NodeTwo :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 4) @@
  NodeThree :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 0) )
/\ messagesSent = ( NodeOne :>
      (NodeOne :> <<>> @@ NodeTwo :> <<1, 1>> @@ NodeThree :> <<1, 1>>) @@
  NodeTwo :> (NodeOne :> <<>> @@ NodeTwo :> <<>> @@ NodeThree :> <<1, 1>>) @@
  NodeThree :> (NodeOne :> <<>> @@ NodeTwo :> <<>> @@ NodeThree :> <<>>) )
/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1] >> @@
  NodeTwo :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2],
         [term |-> 3, contentType |-> TypeSignature, value |-> 2] >> @@
  NodeThree :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2],
         [term |-> 3, contentType |-> TypeSignature, value |-> 2] >> )
/\ state = (NodeOne :> Leader @@ NodeTwo :> Leader @@ NodeThree :> Follower)
/\ votesRequested = ( NodeOne :> (NodeOne :> 0 @@ NodeTwo :> 1 @@ NodeThree :> 0) @@
  NodeTwo :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 1) @@
  NodeThree :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 0) )
/\ commitIndex = (NodeOne :> 0 @@ NodeTwo :> 4 @@ NodeThree :> 0)
/\ currentTerm = (NodeOne :> 2 @@ NodeTwo :> 3 @@ NodeThree :> 3)
/\ committedLog = << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
   [term |-> 2, contentType |-> TypeSignature, value |-> 1],
   [term |-> 3, contentType |-> TypeEntry, value |-> 2],
   [term |-> 3, contentType |-> TypeSignature, value |-> 2] >>
/\ nextIndex = ( NodeOne :> (NodeOne :> 1 @@ NodeTwo :> 2 @@ NodeThree :> 2) @@
  NodeTwo :> (NodeOne :> 3 @@ NodeTwo :> 3 @@ NodeThree :> 5) @@
  NodeThree :> (NodeOne :> 1 @@ NodeTwo :> 1 @@ NodeThree :> 1) )
/\ votesGranted = ( NodeOne :> {NodeOne, NodeTwo} @@
  NodeTwo :> {NodeTwo, NodeThree} @@
  NodeThree :> {} )
/\ votesSent = (NodeOne :> TRUE @@ NodeTwo :> TRUE @@ NodeThree :> FALSE)
/\ votedFor = (NodeOne :> NodeOne @@ NodeTwo :> NodeTwo @@ NodeThree :> NodeTwo)

1366592039 states generated, 199817884 distinct states found, 47033354 states left on queue.
The depth of the complete state graph search is 37.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 11 and the 95th percentile is 3).
Finished in 01h 41min at (2022-05-10 15:46:38)
heidihoward commented 2 years ago

I have fixed the above issue. It was caused by a bug in the helper function CommittedTermPrefix(i, x) that was used by the invariant LeaderCompletenessInv.

Unfortunately, TLC found another invariant violation, this time in LogInv, when checking the reduced state TLA+ spec using the same setup as before. I have copied the final two states below:

Progress(39) at 2022-05-16 20:24:54: 2,451,888,619 states generated (13,195,924 s/min), 349,081,105 distinct states found (1,918,518 ds/min), 83,039,989 states left on queue.
Error: Invariant LogInv is violated.
...
State 36: <AdvanceCommitIndex line 314, col 5 to line 340, col 103 of module ccfraft>
/\ committedLogDecrease = FALSE
/\ messages = { [ mtype |-> AppendEntriesResponse,
    mterm |-> 2,
    msource |-> NodeTwo,
    mdest |-> NodeThree,
    msuccess |-> TRUE,
    mmatchIndex |-> 2 ] }
/\ clientRequests = 3
/\ matchIndex = ( NodeOne :> (NodeOne :> 0 @@ NodeTwo :> 4 @@ NodeThree :> 0) @@
  NodeTwo :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 0) @@
  NodeThree :> (NodeOne :> 2 @@ NodeTwo :> 1 @@ NodeThree :> 0) )
/\ messagesSent = ( NodeOne :> (NodeOne :> <<>> @@ NodeTwo :> <<1, 1>> @@ NodeThree :> <<>>) @@
  NodeTwo :> (NodeOne :> <<>> @@ NodeTwo :> <<>> @@ NodeThree :> <<>>) @@
  NodeThree :>
      (NodeOne :> <<1, 1>> @@ NodeTwo :> <<1, 1>> @@ NodeThree :> <<>>) )
/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2],
         [term |-> 3, contentType |-> TypeSignature, value |-> 2] >> @@
  NodeTwo :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2],
         [term |-> 3, contentType |-> TypeSignature, value |-> 2] >> @@
  NodeThree :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1] >> )
/\ state = (NodeOne :> Leader @@ NodeTwo :> Follower @@ NodeThree :> Leader)
/\ votesRequested = ( NodeOne :> (NodeOne :> 0 @@ NodeTwo :> 1 @@ NodeThree :> 0) @@
  NodeTwo :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 0) @@
  NodeThree :> (NodeOne :> 0 @@ NodeTwo :> 1 @@ NodeThree :> 0) )
/\ commitIndex = (NodeOne :> 4 @@ NodeTwo :> 0 @@ NodeThree :> 0)
/\ currentTerm = (NodeOne :> 3 @@ NodeTwo :> 3 @@ NodeThree :> 2)
/\ committedLog = << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
   [term |-> 2, contentType |-> TypeSignature, value |-> 1],
   [term |-> 3, contentType |-> TypeEntry, value |-> 2],
   [term |-> 3, contentType |-> TypeSignature, value |-> 2] >>
/\ nextIndex = ( NodeOne :> (NodeOne :> 3 @@ NodeTwo :> 5 @@ NodeThree :> 3) @@
  NodeTwo :> (NodeOne :> 1 @@ NodeTwo :> 1 @@ NodeThree :> 1) @@
  NodeThree :> (NodeOne :> 3 @@ NodeTwo :> 2 @@ NodeThree :> 1) )
/\ votesGranted = ( NodeOne :> {NodeOne, NodeTwo} @@
  NodeTwo :> {} @@
  NodeThree :> {NodeTwo, NodeThree} )
/\ votesSent = (NodeOne :> TRUE @@ NodeTwo :> FALSE @@ NodeThree :> TRUE)
/\ votedFor = (NodeOne :> NodeOne @@ NodeTwo :> NodeOne @@ NodeThree :> NodeThree)

State 37: <AdvanceCommitIndex line 314, col 5 to line 340, col 103 of module ccfraft>
/\ committedLogDecrease = TRUE
/\ messages = { [ mtype |-> AppendEntriesResponse,
    mterm |-> 2,
    msource |-> NodeTwo,
    mdest |-> NodeThree,
    msuccess |-> TRUE,
    mmatchIndex |-> 2 ] }
/\ clientRequests = 3
/\ matchIndex = ( NodeOne :> (NodeOne :> 0 @@ NodeTwo :> 4 @@ NodeThree :> 0) @@
  NodeTwo :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 0) @@
  NodeThree :> (NodeOne :> 2 @@ NodeTwo :> 1 @@ NodeThree :> 0) )
/\ messagesSent = ( NodeOne :> (NodeOne :> <<>> @@ NodeTwo :> <<1, 1>> @@ NodeThree :> <<>>) @@
  NodeTwo :> (NodeOne :> <<>> @@ NodeTwo :> <<>> @@ NodeThree :> <<>>) @@
  NodeThree :>
      (NodeOne :> <<1, 1>> @@ NodeTwo :> <<1, 1>> @@ NodeThree :> <<>>) )
/\ log = ( NodeOne :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2],
         [term |-> 3, contentType |-> TypeSignature, value |-> 2] >> @@
  NodeTwo :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1],
         [term |-> 3, contentType |-> TypeEntry, value |-> 2],
         [term |-> 3, contentType |-> TypeSignature, value |-> 2] >> @@
  NodeThree :>
      << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
         [term |-> 2, contentType |-> TypeSignature, value |-> 1] >> )
/\ state = (NodeOne :> Leader @@ NodeTwo :> Follower @@ NodeThree :> Leader)
/\ votesRequested = ( NodeOne :> (NodeOne :> 0 @@ NodeTwo :> 1 @@ NodeThree :> 0) @@
  NodeTwo :> (NodeOne :> 0 @@ NodeTwo :> 0 @@ NodeThree :> 0) @@
  NodeThree :> (NodeOne :> 0 @@ NodeTwo :> 1 @@ NodeThree :> 0) )
/\ commitIndex = (NodeOne :> 4 @@ NodeTwo :> 0 @@ NodeThree :> 2)
/\ currentTerm = (NodeOne :> 3 @@ NodeTwo :> 3 @@ NodeThree :> 2)
/\ committedLog = << [term |-> 2, contentType |-> TypeEntry, value |-> 1],
   [term |-> 2, contentType |-> TypeSignature, value |-> 1] >>
/\ nextIndex = ( NodeOne :> (NodeOne :> 3 @@ NodeTwo :> 5 @@ NodeThree :> 3) @@
  NodeTwo :> (NodeOne :> 1 @@ NodeTwo :> 1 @@ NodeThree :> 1) @@
  NodeThree :> (NodeOne :> 3 @@ NodeTwo :> 2 @@ NodeThree :> 1) )
/\ votesGranted = ( NodeOne :> {NodeOne, NodeTwo} @@
  NodeTwo :> {} @@
  NodeThree :> {NodeTwo, NodeThree} )
/\ votesSent = (NodeOne :> TRUE @@ NodeTwo :> FALSE @@ NodeThree :> TRUE)
/\ votedFor = (NodeOne :> NodeOne @@ NodeTwo :> NodeOne @@ NodeThree :> NodeThree)

2455802346 states generated, 349623333 distinct states found, 83184962 states left on queue.
The depth of the complete state graph search is 39.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 11 and the 95th percentile is 3).
Finished in 03h 04min at (2022-05-16 20:25:14)

To avoid creating too many issues in this repo, I will use this issue to track the various state invariant violations found in the reduced state TLA+ spec until TLC can complete model checking without finding any violations

achamayou commented 2 years ago

@heidihoward creating lots of issues (and closing ones that have been resolved, with a link to a successor and to the corresponding PR) is absolutely fine, please feel free to do this however you think is best.

achamayou commented 2 years ago

Looking at this trace, I think there is something wrong with the election logic, there doesn't seem to be any way node 3 should be elected considering the state of the logs.

heidihoward commented 2 years ago

Node 3 is an old leader from term 2 who has not yet stepped down. I believe the issue here was in the logic for setting commitedLogDecrease. I have patch it (https://github.com/heidihoward/CCF/commit/615d2ba72db0869d1c8567f8f0e8e9c3023fee61) and it is currently model checking.

heidihoward commented 2 years ago

This spec has now successfully completed model checking for the first time (with a reasonable state space).

The configuration used is copied below:

---------- MODULE MCraft ----------
EXTENDS ccfraft, TLC

Server_mc == {NodeOne, NodeTwo, NodeThree}

\* Limit the terms that can be reached. Needs to be set to at least 3 to
\* evaluate all relevant states. If set to only 2, the candidate_quorum
\* constraint below is too restrictive.
TermLimit == 3

\* Limit number of requests (new entires) that can be made
RequestLimit == 2

\* Limit on number of request votes that can be sent to each other node
RequestVoteLimit_mc == 1

MessagesLimit_mc == 1

mc_spec == Spec

constraint_term == \A i \in Server : currentTerm[i] <= TermLimit
\* Constraint for the request limit. clientRequests starts at 1
\*  and increments with each request, so we add 1 to the user
\*  defined request limit.
constraint_requests == clientRequests <= (RequestLimit + 1)

Symmetry == Permutations(Server_mc)

\* We made several restrictions to the state space of Raft. However since we
\* made these restrictions, Deadlocks can occur at places that Raft would in
\* real-world deployments handle graciously.
\* One example of this is if a Quorum of nodes becomes Candidate but can not
\* timeout anymore since we constrained the terms. Then, an artificial Deadlock
\* is reached. We solve this below. If TerLimit is set to any number >2, this is
\* not an issue since breadth-first search will make sure that a similar
\* situation is simulated at term==1 which results in a term increase to 2.
constraint_candidate_quorum ==
    Cardinality({ i \in Server : state[i] = Candidate}) <= 1

And the final output was as follows:

Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 3.4
  based on the actual fingerprints:  val = .25
26436983870 states generated, 2633721642 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 73.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 10 and the 95th percentile is 2).
Finished in 2d 10h at (2022-05-23 00:44:59)

I'll run some further tests and then create a PR.