Open pires opened 7 years ago
A related proposal would be to include an uninstall
flag/mode for the scheduler; when specified the scheduler would not register at all with Mesos and instead would issue a teardown call following by removing any and all ZK state .. and then promptly terminate (successfully). I'd expect that this mode of execution for the scheduler would be modeled as a one time job vs a long running service.
Problem
Currently, when this framework is deployed, the following may happen:
This framework hasn't been deployed before, hence a clean start is performed and everything just works.
This framework has been deployed before and the original scheduler is active, hence Mesos failovers between the original scheduler and the new scheduler, in a potentially endless loop of starting and terminating scheduler tasks. The existing
etcd
member tasks continue to run but are now unmanaged.This framework has been deployed before, the original scheduler is missing and the original framework failover timeout hasn't been reached. In this scenario, the scheduler tries to reconcile the cluster state information and , eventually, resumes work.
This framework has been deployed before, the original scheduler is missing and the original framework failover timeout has been reached. In this scenario, the scheduler is promptly terminated by Mesos and the framework is no longer usable in this cluster without manual intervention, i.e. removing entries from Zookeeper.
This is arguably a poor user experience.
Solution
I propose the addition of a flag, i.e.
framework-deployment-strategy
, that instructs the framework how it should handle (2), (3) and (4). I also propose the following possibilities:failover
- the default strategy. When executed in this mode, the scheduler will perform leader-election and the leader will take over.upgrade
- since upgradingetcd
may not be as simple as rolling-update members, this would be a no-op for now.hard
- before registration, the framework invokes the termination/deletion of any framework that is running or has ran in the past, through the Mesos teardown API, resulting in the termination of all tasks related to this framework execution. It's assumed the framework principal is authorized to request framework teardown.All of the aforementioned functionality that is missing currently, e.g. scheduler leader-election, is expected to be implemented within the scope of this feature.
Refs #91 #95 #106