mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 528 forks source link

document how to run frameworks as HA given 'fail fast' ZK behaviour #517

Open air opened 9 years ago

air commented 9 years ago

See #513. Mesos and related projects will deliberately exit when they lose a reliable connection to ZK or replicated log. Rather than build in looping logic, they delegate to the Operator to manage their process using an external system.

Action: document the self-termination behaviour and provide examples of how an Operator should run Chronos to achieve HA.

Example from @brndnmtthws:

In a reference architecture, you typically run 2+ instances of Chronos, with 3-5 instances of ZK. Individual instances of Chronos may come and go.

This impacts https://mesos.github.io/chronos/docs/ and possibly also https://mesosphere.github.io/marathon/docs/high-availability.html and Mesos itself.

aphyr commented 9 years ago

While you're at it, you may want to update the debian packages to set up a supervising process as well, and make service mesos restart idempotent, instead of launching multiple copies.

gkleiman commented 9 years ago

@aphyr: thanks for your feedback on the Debian packages.

I created the corresponding issues in the mesosphere/chronos-pkg repository:

air commented 9 years ago

The doc improvements should also cover (please expand on these):

Q: Should I run multiple Chronos instances?