mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.38k stars 529 forks source link

Chronos doesn't re-register itself with Mesos if mesos-master restarts #480

Open dgassaway opened 9 years ago

dgassaway commented 9 years ago

Noticed I have to restart Chronos if the Mesos master restarts.

Setup to reproduce: Mesos 0.22.1 one master, one slave Chronos 2.3.4 (running as a service on Marathon 0.8.2 - but this should work regardless) config for Chronos checkpointing and failover timeout are 2.3.4 defaults.

Seems like the framework should handle this (Marathon stays as an active framework correctly) and reconnect.

gkleiman commented 8 years ago

I was able to reproduce this with Chronos, Marathon, and mesos-execute.

These are all non-HTTP Frameworks and the MesosSchedulerDriver is not aware of the Mesos Master going away, because of limitations in its design.

This is can be fixed by moving Chronos to the new Mesos HTTP API.

dlsuzuki commented 7 years ago

I noticed that this issue isn't listed in the preliminary 2.5.0 changelog. If it's likely to miss the cut, perhaps someone in my organization could take a shot at it.

dlsuzuki commented 7 years ago

It doesn't look like Chronos 3.0.1 addresses this. Is the loss of registration not a major production issue for most users? I've implemented some external workarounds, but they're incredibly kludgy.

brndnmtthws commented 7 years ago

We'd need to add something similar to Marathon's heartbeat monitor to fix this properly.