vmware-archive / haret

A strongly consistent distributed coordination system, built using proven protocols & implemented in Rust.
461 stars 18 forks source link

Implement view change optimizations in RFC 2 #128

Closed andrewjstone closed 7 years ago

andrewjstone commented 7 years ago

Additionally add some other protocol fixes/enhancements:

  1. Ensure PrepareOk message for last log entry gets sent after state transfer
  2. Rebroadcast prepare if primary has not received quorum
  3. Send PrepareOk for last log entry after view change and recovery

Some notes about the above enhancements:

  1. When there are outstanding uncommitted operations, and a primary timeout occurs resend the prepare instead of a commit message to ensure it eventually commits even if no more new client requests are received.

    Track the broadcast in PrepareRequests using the pid of the primary if it wasn't the one to receive the client request, such as after a view change. On commit filter on this pid so that a client reply is not sent from the primary to itself.

    If a backup receives a Prepare message for the last operation in it's log it sends a PrepareOk so that the primary can commit in case prior PrepareOk messages were dropped.

  2. When becoming a backup, send a PrepareOk message for the last log entry to the primary. This is specified in the paper for view change, but not recovery. For recovery this allows the primary to update it's min_accepted map without having to wait for a new client request which may not come for quite a while. Note that this message can also be dropped, but as it is an optimization it is still safe.

Testing

Add a model of the VR replicas for QC tests

An abstract model of the state of a VR replica set has been created for use in quickcheck tests. The state of all the replicas is compared against the model after each operation to ensure correctness.

The primary idle timeout has been moved from the primary state into the context so that any changes to its configuration can carry across state transitions.

Add a history member to the scheduler that allows us to track messages and operations throughout tests and then only show the failing history. This is especially useful in QC tests when we only want to display the history for the shrunk test.

Fix make-devrel to use the right binary to generate configuration.

andrewjstone commented 7 years ago

@evanmcc I implemented this pretty close to the RFC (along with some other minor code cleanup), and the tests pass. I think though that in order for it to be correct I also need to send the min_commit num in RecoveryResponse messages. The reasoning is that while a recovering node will always recover to at least the min commit, it still needs to know the value in case there is a view change.

One other thing I realized is that I don't think the commit num is actually what we need. We can't actually assume that it's really a "global commit num" anyway as those nodes may not have gotten commit messages for their prepares, so a prepare at op n doesn't imply a commit at n-1. However, that doesn't really matter in the case of this code. What we are trying to do is ensure we transfer the minimum amount of log entries during view change. I believe what we care about tracking is the min_global_prepare log entry.

If a primary got PrepareOK messages from all backups for op n, then during a view change no log entries will need to be transferred. However, if we track the latest commit as n-1 and don't receive any more prepares before a view change we will always be sending 1 entry in DoViewChange and StartView messages unnecessarily.

evanmcc commented 7 years ago

you're right. the current design only really works for fifo message queues. I don't have time to look at the patch today, but I'll try to get to it before Monday. I think that you're right re: recovery. I'm still drifting back to my feeling that current commit needs to be added, at least occasionally, to the prepare_ok message, or to some other message (metadata_sync?) that could be OOB with respect to the overall protcol, and sent every so often.

andrewjstone commented 7 years ago

I don't think the current design requires fifo queues. State transfer will bring lost/out of order messages up to date and then a prepare_ok will occur. Also, the reason I don't think a current commit needs to be added to PrepareOk is because if all nodes send a PrepareOk they all have it, and even if a view change occurs, it will always commit since it can't be reordered in that scenario.