mesosphere-backup / etcd-mesos

self-healing etcd on mesos!
Apache License 2.0
67 stars 19 forks source link

cluster re-seed support #10

Closed spacejam closed 9 years ago

spacejam commented 9 years ago

When a cluster experiences livelock for reseed-timeout seconds, the scheduler should:

  1. determine liveness of each etcd server
  2. determine raft index of each live etcd server
  3. for each server, starting with the highest raft index and ending with the lowest, try to reseed a cluster using that node. If it succeeds, try to create at least cluster-size / 2 NEW slave instances, and kill the other previous members if that is successful.
  4. If none of the reseeds succeed, do NOTHING - an operator needs to perform a manual backup and restore, and we don't want to kill our tasks which will cause their mesos sandbox to be rm'd in the mean time.
spacejam commented 9 years ago

executor re-seed support: https://github.com/mesosphere/etcd-mesos/pull/18

scheduler selection logic based on raft indices: https://github.com/mesosphere/etcd-mesos/pull/19

spacejam commented 9 years ago

fully automatic reseeds: https://github.com/mesosphere/etcd-mesos/pull/23