mesos / chronos

Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules
http://mesos.github.io/chronos/
Apache License 2.0
4.39k stars 529 forks source link

Automatic retry of failed scheduled job in 3.0.2 causes removal of repetition limits #814

Open dlsuzuki opened 7 years ago

dlsuzuki commented 7 years ago

We have identified a significant problem under Chronos 3.0.2 under the following conditions:

1) A scheduled job is configured with limited repetitions (e.g., R1/2017-03-13T00:01:00.000Z/PT60S) 2) The job is configured with multiple retries 3) The job fails on its initial run and Chronos requeues it for a retry

With older versions of Chronos, the job schedule changes from R1 to R0 when it begins executing, then is returned to R1 if the final retry fails. Under 3.0.2, when the first retry begins, the schedule is modified to begin at the present time with infinite repetitions (e.g., R/2017-03-13T00:34:43.932Z/PT60S). The interval is the only piece that isn't changed. As a result of this, the retries run normally but Chronos also tries to run the job again every time the interval passes. Since many of our jobs are designed to only run once, that turns out to be a major problem for us.

I did verify that this doesn't happen to dependent jobs with retries (as expected, since there's no ISO8601 schedule entry for these jobs).

For now, we're going to disable retries on all scheduled jobs and just plan on additional manual remediation. However, if there's any known way to work around this problem I definitely want to know about it.

dlsuzuki commented 7 years ago

I've implemented an internal workaround, which is to set these one-time daily jobs to PT24000H. They get purged after five days, so this way they'll never accidentally re-run even if they get set to infinite recurrences. Probably a good safety net to have, even if the root cause gets fixed.

IT18Monkey commented 5 years ago

I have the same problem.is this bug fixed?