mesosphere-backup / etcd-mesos

self-healing etcd on mesos!
Apache License 2.0
68 stars 19 forks source link

[PROPOSAL] Add framework deployment collision handling strategy #122

Open pires opened 7 years ago

pires commented 7 years ago

Problem

Currently, when this framework is deployed, the following may happen:

  1. This framework hasn't been deployed before, hence a clean start is performed and everything just works.

  2. This framework has been deployed before and the original scheduler is active, hence Mesos failovers between the original scheduler and the new scheduler, in a potentially endless loop of starting and terminating scheduler tasks. The existing etcd member tasks continue to run but are now unmanaged.

  3. This framework has been deployed before, the original scheduler is missing and the original framework failover timeout hasn't been reached. In this scenario, the scheduler tries to reconcile the cluster state information and , eventually, resumes work.

  4. This framework has been deployed before, the original scheduler is missing and the original framework failover timeout has been reached. In this scenario, the scheduler is promptly terminated by Mesos and the framework is no longer usable in this cluster without manual intervention, i.e. removing entries from Zookeeper.

This is arguably a poor user experience.

Solution

I propose the addition of a flag, i.e. framework-deployment-strategy, that instructs the framework how it should handle (2), (3) and (4). I also propose the following possibilities:

All of the aforementioned functionality that is missing currently, e.g. scheduler leader-election, is expected to be implemented within the scope of this feature.

Refs #91 #95 #106

jdef commented 7 years ago

A related proposal would be to include an uninstall flag/mode for the scheduler; when specified the scheduler would not register at all with Mesos and instead would issue a teardown call following by removing any and all ZK state .. and then promptly terminate (successfully). I'd expect that this mode of execution for the scheduler would be modeled as a one time job vs a long running service.