Implement View Change optimizations

andrewjstone commented 7 years ago

Stop shipping full logs during the view change protocol, as this is very expensive.

In section 5.3 of the VRR paper, a strategy is described where the protocol is modified to avoid this problem. The solution has the goal of:

Not adding unnecessary steps to the protocol
Sending small messages

Instead of sending the whole log, replicas only send a suffix of the log (1 or a few entries) to the proposed primary in the DoViewChange message. In most cases this is enough to get the primary up to date, unless it has lagged far behind. In this case it needs to do state transfer to bring itself up to date.

This solution in the mostly up to date case works fine. It has both small messages and no new messages. However, when the proposed primary is too far behind, it requires an additional GetState and NewState request/response pair from proposed primary to a replica. This has a problem though, in that replicas in the view change state are not supposed to respond to GetState messages (per section 5.2 of the VRR paper). Therefore in order to implement this operation, we must either add a new message ViewChangeGetState or allow replicas in ViewChange state to respond to GetState messages, but only when they come from the proposed primary in the same view. I believe the latter was the actual intention of the paper, and doesn't require adding new messages to the protocol.

andrewjstone commented 7 years ago

One more issue to consider is that while transferring state, we don't want the other replicas to time out and start a new view change. This could be a problem if the state to transfer is large and it takes longer than the view_change timeout to transfer all the data. A fix to this problem is to have the proposed primary periodically send ContinueWaitForStartView messages while it is updating it's state, but before it becomes the primary. This introduces a new message to the protocol, but should be safe as it is just a keep alive.

andrewjstone commented 7 years ago

This is all going to require some more thought. In fact there may be more substantial changes to the view change protocol to make things more efficient. I'm currently thinking that going toward a more Raft like approach, where only the most up to date member (of the electing quorum) can become the leader is a more workable solution than the round robin selection done in the VRR paper. Note that any solution would still maintain both the StartViewChange and DoViewChange 2 phase system to prevent the errors discussed in section 8.2 of the VRR paper. (Note that I don't think those errors can actually occur in this implementation due to the way it utilizes TCP, but sticking with the asynchronous model of the paper allows us to open the door to a possible UDP or overlay network implementations later, and also makes formal analysis easier).

Due to the complexity of the changes, I'm going to start writing the first RFC for Haret. This RFC will be based on the template for Rust language RFCs. There isn't currently an approval process, or a team large enough to formalize one. The hope is that I can gather enough feedback from domain experts in consensus to poke holes and/or help to get this solidified.

andrewjstone commented 7 years ago

An RFC was written for this issue and is under review: https://github.com/vmware/haret/pull/98

andrewjstone commented 7 years ago

RFC 1 was rejected in favor of RFC 2 in #106

evanmcc commented 7 years ago

A pragmatic concern here. The approach in RFC 2 allows the cluster nodes to run within pretty tight memory bounds, essentially the size of the backend + a few log entries. But when a reconfiguration (or a recovery from 0) starts, we essentially lose this bound (because minimum commit is now 0, essentially), and start using monotonically more memory until the new node is able to recover. I still think that this approach is better, but this limitation does suggest a couple of potential tweaks:

prewarm state where a node can start a state sync without formally joining the cluster and affecting
backends that can be synced piecemeal (e.g. via merkle-tree based sync or similar)
guidance, either in the documentation or the in the node telemetry (probably better/more accurate) as to estimated time to memory exhaustion if a sync starts. Combined with operational knowledge of the time for node replacement, this will allow operators to design effective alerts.

andrewjstone commented 7 years ago

This was something I was thinking about at the same time as you it appears. When the RFC was written I was not considering that the min commit num could ever go backwards. Practically, it probably cannot since the log is likely garbage collected after that point. For recovery, the entire state needs to be shipped to the node anyway. Once state is checkpointed with a WAL, the most up to date state will already be on disk (in non-diskless scenarios), so this isn't a problem. The problem is that the log (in a diskless scenario) will grow from it's current min commit num until the recovery completes, as you state, and can also grow if the failing node doesn't boot back up for a while, or forever. If the log gets too large before recovery completes we can always reject new commits due to memory pressure. This is unsatisfactory though. I'm unsure of the solution here. We could always exclude recovering replicas from the min commit num selection, and continue incrementing and truncating the log. If a view change occurs after recovery and the recovered, but not quite up to date, replica is selected, and is sent logs that aren't quite full we could default to some sort of merkle exchange as you suggest, or simply not accept the DoViewChange messages and wait for the next view. The latter is certainly less complex, although I haven't really thought through all the ramifications.

For reconfiguration, we definitely want to pre-warm new replicas with a 2 phase reconfiguration (plan + commit). This probably involves some sort of administrative command where we can bring the new replicas online, and sync them to the latest min commit. At this point we freeze the min commit num and send the VRR reconfiguration request. We then start advancing the min commit again based on the configuration in the new group. This does have the same growth issue, but it shouldn't really be problematic, as pre-warmed replicas should sync quickly.

evanmcc commented 7 years ago

the special case of a KV store is not so bad, I think. if you can enable the backend to selectively admit writes based on some frontier, then you can simultaneously stream in snapshot writes and current writes (ideally in a mode that disables persistence). then backup writes are ignored once they've been overwritten, and you can have the node send a special purpose message to update its min commit without participating in the quorum. unfortunately it's much harder in the general case.

does that make sense?

On May 30, 2017 11:35 PM, "Andrew J. Stone" notifications@github.com wrote:

This was something I was thinking about at the same time as you it appears. When the RFC was written I was not considering that the min commit num could ever go backwards. Practically, it probably cannot since the log is likely garbage collected after that point. For recovery, the entire state needs to be shipped to the node anyway. Once state is checkpointed with a WAL, the most up to date state will already be on disk (in non-diskless scenarios), so this isn't a problem. The problem is that the log (in a diskless scenario) will grow from it's current min commit num until the recovery completes, as you state, and can also grow if the failing node doesn't boot back up for a while, or forever. If the log gets too large before recovery completes we can always reject new commits due to memory pressure. This is unsatisfactory though. I'm unsure of the solution here. We could always exclude recovering replicas from the min commit num selection, and continue incrementing and truncating the log. If a view change occurs after recovery and the recovered, but not quite up to date replica is selected, and is sent logs that aren't quite full we could default to some sort of merkle exchange as you suggest, or simply not accept the doViewChange messages and wait for the next view. The latter is certainly less complex, although I haven't really thought through all the ramifications.

For reconfiguration, we definitely want to pre-warm new replicas with a 2 phase reconfiguration (plan + commit). This probably involves some sort of administrative command where we can bring the new replicas online, and sync them to the latest min commit. At this point we freeze the min commit num and send the VRR reconfiguration request. We then start advancing the min commit again based on the configuration in the new group. This does have the same growth issue, but it shouldn't really be problematic, as pre-warmed replicas should sync quickly.

I don't think we need a Merkle tree

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vmware/haret/issues/93#issuecomment-305097787, or mute the thread https://github.com/notifications/unsubscribe-auth/AACYLGfOTKxbGwPtIjypMQbcuugjhFAyks5r_Qo9gaJpZM4NNdEL .

andrewjstone commented 7 years ago

Implemented in #128

vmware-archive / haret

Implement View Change optimizations #93