vmware-archive / haret

A strongly consistent distributed coordination system, built using proven protocols & implemented in Rust.
461 stars 18 forks source link

Add support for interleaving the VR messages from multiple quickcheck operations #122

Open andrewjstone opened 7 years ago

andrewjstone commented 7 years ago

Instead of running each operation in quickcheck serially and then sending all the VR messages corresponding to that operation, allow running multiple operations in parallel and interleaving the VR messages until a desired state is reached or an error or timeout occurs.

Right now, when an Op::ViewChange is run it results in a Tick message being send to a backup node to trigger a view change. It then plays all resulting VR messages: StartViewChange, DoViewChange, and StartView until there are no more messages left to send. What I'd like to do is be able to run multiple ops and interleave these view change messages with recovery messages. If a view change is going on a recovery cannot successfully complete and must retry. When the view change then completes, retry to recover and ensure it does or the test times out and fails, or some invariant is violated. This should lead to more interesting histories.

The goal is not to do this per operation but to allow writing tests to run operations in parallel and interleave the resulting VR messages. Postconditions can be waited on to result in retries where necessary to bring the system back to a healthy state. Note that it would be handy to also allow dropping messages and simulating partitions during these tests.

The key challenge is how to enforce the determinism of VR msg interleavings so rerunning the same test results in the same order of messages being sent from and arriving at each node. Since all replicas in these tests run in a single thread it should be possible to maintain the same exact linear order of messages processed given the same random seed value. The reason for trying to maintain this order is to be able to reliably reproduce failures and replay the failing history with debugging messages turned on.