mesosphere-backup / etcd-mesos

self-healing etcd on mesos!
Apache License 2.0
67 stars 19 forks source link

reseeding cluster for N-1 failure recovery #23

Closed spacejam closed 9 years ago

spacejam commented 9 years ago

@jdef @tsenart @karlkfi @sttts Final big feature for fault tolerance!

Tested by:

  1. starting cluster of 3 nodes
  2. starting write and read load generators
  3. killing the scheduler
  4. killing 2 nodes while the scheduler is dead, livelocking the cluster
  5. starting the scheduler and waiting for livelockTimeout (defaults to 4 mins of livelock before intervention)
  6. verifying that the cluster performs a reseed and returns to the desired state

Interesting observation:

spacejam commented 9 years ago

This train is about to leave! If you'd like to suggest feedback, feel free to comment on this PR post-merge and your feedback will be taken into consideration.