Deal with Total Cluster Failure in diskless log scenario

The Viewstamped Replication Revisited protocol used by Haret is different from both Raft and Paxos in that it doesn't require any syncing to disk to operate and tolerate a minority of failures. However, it also suffers from the fact that if a majority of replicas fail at the same time, the system becomes unrecoverable.

Utilizing snapshots, haret can minimize the amount of data loss in this scenario and allow a restart of failed replicas. Their needs to be some way to join them into the cluster manually via an admin disaster recovery protocol. etcd has good docs on this for their system.

vmware-archive / haret

Deal with Total Cluster Failure in diskless log scenario #75