vmware-archive / haret

A strongly consistent distributed coordination system, built using proven protocols & implemented in Rust.
461 stars 18 forks source link

Deal with Total Cluster Failure in diskless log scenario #75

Open andrewjstone opened 7 years ago

andrewjstone commented 7 years ago

The Viewstamped Replication Revisited protocol used by Haret is different from both Raft and Paxos in that it doesn't require any syncing to disk to operate and tolerate a minority of failures. However, it also suffers from the fact that if a majority of replicas fail at the same time, the system becomes unrecoverable.

Utilizing snapshots, haret can minimize the amount of data loss in this scenario and allow a restart of failed replicas. Their needs to be some way to join them into the cluster manually via an admin disaster recovery protocol. etcd has good docs on this for their system.