Open pires opened 7 years ago
Seems like this should be low on the priority list. When running the etcd-mesos framework as a Marathon task, if the task dies then it's pretty trivial for Marathon to re-launch it on another node in the cluster. The risk of this approach is that there may not be resources immediately available for such a launch. This risk is mitigated, somewhat, by the failover timeout specified by the framework - the existing executors will continue to run (barring some tragedy or node maintenance activity). Leader election adds complexity and there are bigger fish to fry (that customers have asked for) before tackling this proposal.
According to the Mesos High-Availability Framework guide, a framework should run an odd number (n >=3) of scheduler instances in order to provide tolerance to scheduler failures.
One should implement leader-election and only the master scheduler would receive offers and manage tasks.
Note: Pay special attention on how to handle network partitions, i.e. if no quorum (n/2-1) then no one in a partition becomes a leader.