Open luluz66 opened 4 days ago
Dragonboat guarantees that entries are ordered, as this is required by the Replicated State Machine concept. Each committed entry is assigned a monotonically increased uint64 Entry Index, when applied into the state machine, all Entry Index values are checked to ensure ordering is strictly enforced. Please see StateMachine.setLastApplied() and StateMachine.setApplied() in internal/rsm/statemachine.go for details.
Please note that your custom IOnDiskStateMachine implementation must be consistent in the way how its internal state is maintained. For example, when the Open() method of your IOnDiskStateMachine is called, an Entry Index is returned to indicate the progress of the State Machine, in your case, your internal index is maintained by your code, does such Entry Index and your internal index are consistent (i.e. came from the same committed Entry)?
to help debugging the issue you experienced -
in the Update() method of your IOnDiskStateMachine, Entry.Index is the dragonboat assigned index value I mentioned above. you may want to log that value, then pay some special attention to the Entry.Index returned by the Open() method and how such index value is internally stored by your IOnDiskStateMachine. you may also want to double check on whether such index value is available in your dragonboat snapshots.
note that dragonboat is expected to see monotonically increased continuous Entry.Index values, there shouldn't be any holes, but your IOnDiskStateMachine's Update() method will see some holes, they are from entries not visible to your IOnDiskStateMachine, e.g. NoOP raft entries, membership change entries etc.
Dragonboat version
v4
We have implemented a session to ensure our IOnDiskStateMachine is idempotent. The session consists of a UUID and a monotonically-increasing index. In the replica.Update() method, we check if the index in the request session is bigger than the one stored; and return an error if it goes backward.
Last week, we saw a panic due to the session index going backward. This happened in the middle of our rollout. One machine shut down cleanly and then restarted, loaded the replica from the disk. About two minutes later, it crashed due to the above update error.
Wondering if you have some idea on how this can happen? Is it guaranteed that dragonboat will keep the entries in order?