mesosphere-backup / etcd-mesos

self-healing etcd on mesos!
Apache License 2.0
67 stars 19 forks source link

[PROPOSAL] Allow for multiple scheduler instances #121

Open pires opened 7 years ago

pires commented 7 years ago

According to the Mesos High-Availability Framework guide, a framework should run an odd number (n >=3) of scheduler instances in order to provide tolerance to scheduler failures.

One should implement leader-election and only the master scheduler would receive offers and manage tasks.

Note: Pay special attention on how to handle network partitions, i.e. if no quorum (n/2-1) then no one in a partition becomes a leader.

jdef commented 7 years ago

Seems like this should be low on the priority list. When running the etcd-mesos framework as a Marathon task, if the task dies then it's pretty trivial for Marathon to re-launch it on another node in the cluster. The risk of this approach is that there may not be resources immediately available for such a launch. This risk is mitigated, somewhat, by the failover timeout specified by the framework - the existing executors will continue to run (barring some tragedy or node maintenance activity). Leader election adds complexity and there are bigger fish to fry (that customers have asked for) before tackling this proposal.