document how to run frameworks as HA given 'fail fast' ZK behaviour

mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules

http://mesos.github.io/chronos/

Apache License 2.0

4.39k stars 528 forks source link

document how to run frameworks as HA given 'fail fast' ZK behaviour #517

Open air opened 9 years ago

air commented 9 years ago

See #513. Mesos and related projects will deliberately exit when they lose a reliable connection to ZK or replicated log. Rather than build in looping logic, they delegate to the Operator to manage their process using an external system.

Action: document the self-termination behaviour and provide examples of how an Operator should run Chronos to achieve HA.

Example from @brndnmtthws:

In a reference architecture, you typically run 2+ instances of Chronos, with 3-5 instances of ZK. Individual instances of Chronos may come and go.

This impacts https://mesos.github.io/chronos/docs/ and possibly also https://mesosphere.github.io/marathon/docs/high-availability.html and Mesos itself.

aphyr commented 9 years ago

While you're at it, you may want to update the debian packages to set up a supervising process as well, and make service mesos restart idempotent, instead of launching multiple copies.

gkleiman commented 9 years ago

@aphyr: thanks for your feedback on the Debian packages.

I created the corresponding issues in the mesosphere/chronos-pkg repository:

air commented 9 years ago

The doc improvements should also cover (please expand on these):

Q: Should I run multiple Chronos instances?

A: IF you're using Marathon THEN No, run a single instance. Tradeoffs:
- Simplicity, no leader election.
- If Chronos terminates you will get a window of non-job-execution while Marathon restarts the process.
- Is this also true if a network partition occurs?
A: IF you're not using Marathon THEN Yes, run multiple instances. Tradeoffs:
- Chronos leader election required.
- Reduced interruption in service.