mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.38k stars 529 forks source link

flat zookeeper layout storage and faulty jobs #773

Open vixns opened 7 years ago

vixns commented 7 years ago

I added a task to a chronos, this task was faulty ( typo in docker image tag ), and chronos ( latest 2.x version ) tried to launch it many times silently (no loop protection, no errors in sentry).

After a while, chronos became unstable, and after killing it, it can no longer be started.

Each time a task launch, a new entry is added in /chronos/state/state. With a faulty job, it loops until it raise zookeeper node size limits and chronos starts to become unstable.

Marathon was facing a similar issue and changed it's zookeeper storage layout since version 1.3.

Chronos should prevent faulty job to loop, report launch errors in sentry / slack /... , and adopt a nested storage layout.

For reference, to fix a bloated state:

Increase jute.maxbuffer to workaround zk limits and retrieve the huge children list

brndnmtthws commented 7 years ago

I'm not convinced we even need to store state for individual tasks. I'll look into this and see what we can do.