vmware-archive / haret

A strongly consistent distributed coordination system, built using proven protocols & implemented in Rust.
461 stars 18 forks source link

Optimize Reads #103

Open andrewjstone opened 7 years ago

andrewjstone commented 7 years ago

Reads don't need to go into the log. They just need to receive quorum in order to be linearizable. When a read occurs and there are no outstanding prepare messages, the data is read at the primary and a commit message is broadcast with a Read flag set. If the Read flag is set, the backups respond with a CommitOk message containing the epoch, view and commit_num. If the Read flag is not set, backups don't respond at all (normal VRR protocol). When quorum is received, the already read data is returned to the user.

If there are multiple pending prepare requests due to retries, the read is tracked with the last outstanding prepare at the primary, which has not yet gone out. When the last prepare is sent it's roundtrip counts as the read quorum as well. When a quorum of prepareOk messages is received, the commit for the latest write is made, and the data is read from the primary and returned to the client. Note however that if there is only one outstanding prepare request already in flight, the primary must wait an extra round trip before returning the read. It either waits for quorum on the retry of the prepare if it times out or the next commit or prepare request that was received after the read request.

This change also requires checking client requests to see if they are reads or writes. Right now this isn't indicated in the vr specific client request itself. It should be however, as we want to keep the VR protocol itself agnostic to substance of state machine operations. It only has to know if they are reads or writes. Note that this information doesn't need to be added to the client protocol as that would be redundant for users. It can be added when the the messages are received by the api server and converted to VR client requests.

evanmcc commented 7 years ago

What are reads?

I assume what you mean here is a operation that doesn't mutate the underlying representation of the state machine. In which case I am not totally sure that this belongs in the wire protocol, but is really something that is properly owned by the state machine being expressed. That way the client doesn't need to know anything about the properties of any individual operation, but the protocol state machine is allowed to optimize certain paths when the replicated state machine indicates that they may be safe. In erlang, you'd signal this from the implemented fsm to the implementing behavior by doing something like {non_mutating_op, UpdatedState} or even just non_mutating_op since the prior State is still valid. I'm not sure what the idiomatic way to do this in rust would be, though.

andrewjstone commented 7 years ago

@evanmcc Yes, reads are operations that don't mutate state. I agree with you that the client shouldn't have to know about the properties of a given operation. From re-reading what I wrote late last night, it appears I wasn't clear and a bit contradictory. In addition to not mutating the backend state I also don't want these operations in the log. This would unnecessarily bloat the log as it's expected for most workloads that reads vastly outnumber writes. In order for that to work the state machines must be told whether the operation is a read or a write. It can do this by either having the fsms themselves understand operations semantically or by indicating to the fsm whether the operation is a read or write. I prefer the latter as it keeps the FSMs agnositic to operations and prevents the need to update them for every new operation.

My plan is to have the API server endpoint look at the client request and when it transforms it into the VR specific message handled by the vr_fsm, it will indicate whether the request is a read or write. This will require changing the VrMsg enum a bit to add this data.

Also note that in the linked code, the conversion function proto_tree_op_to_vr_tree_op(msg) should instead use the standard rust conversion traits to make the code more idiomatic.

evanmcc commented 7 years ago

Maybe I'm confused here. The state machine that is being replicated must be updated for every new op, since it's where the different ops happen, right? What protocol the VR state machine uses to optimize or log should be entirely separated from these operations, and the operations/messages the VR state machine sends to do that replication.

As to your second point, putting that semantic burden on the client seems error-prone. Given some operation Q that used to not update the state of the replicated state machine, if we change it to have some stateful side effect we then have to make two changes, one to the code that implements the operation and one to the client code that can request it. What happens if we forget that Q is no longer a read from the client perspective, or we get a request to Q from an older version of the client?

andrewjstone commented 7 years ago

I see what's going on. We are talking about 2 different state machines. The one I was referring to that should only rely on opaque operations is the FSM implementing the VR protocol. When committed the operations themselves are then passed to the replicated state machine as described in the VR paper. This state machine must understand the operations since it is the one actually processing them. I call this the backend, and it utilizes vertree.

To re-iterate, the client should never have to know whether an operations is a read or write. The change to implement this will not require changing the Protobuf definitions of the client API at all. It only requires a change to the internal API server so that it can massage the VrMsg so that the VR FSM doesn't need to have to understand the internals of a VrApiReq that gets interpreted by the backend.