cluster re-seed support

spacejam commented 9 years ago

When a cluster experiences livelock for reseed-timeout seconds, the scheduler should:

determine liveness of each etcd server
determine raft index of each live etcd server
for each server, starting with the highest raft index and ending with the lowest, try to reseed a cluster using that node. If it succeeds, try to create at least cluster-size / 2 NEW slave instances, and kill the other previous members if that is successful.
If none of the reseeds succeed, do NOTHING - an operator needs to perform a manual backup and restore, and we don't want to kill our tasks which will cause their mesos sandbox to be rm'd in the mean time.

spacejam commented 9 years ago

scheduler selection logic based on raft indices: https://github.com/mesosphere/etcd-mesos/pull/19

spacejam commented 9 years ago

mesosphere-backup / etcd-mesos