Closed tychoish closed 1 year ago
Merging #55 (20b69f2) into master (4d78f5c) will increase coverage by
0.25%
. The diff coverage is94.73%
.
:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.
@@ Coverage Diff @@
## master #55 +/- ##
==========================================
+ Coverage 86.22% 86.48% +0.25%
==========================================
Files 8 8
Lines 530 540 +10
==========================================
+ Hits 457 467 +10
Misses 52 52
Partials 21 21
Impacted Files | Coverage Δ | |
---|---|---|
quartz/scheduler.go | 93.04% <94.73%> (+0.39%) |
:arrow_up: |
I've poked around this some more and have have fixed some broken tests. The cause of the test failure was a pretty silly error on my part around cleanup up after a timer reset, and I think the premise of this, combined with #56 pretty good, and boils down to:
@tychoish I'm trying to understand how one job can be scheduled more than once. Jobs may be selected to run out of order in edge cases, but this is not critical. Is it possible to create a unit test that reproduces the issue?
@tychoish I'm trying to understand how one job can be scheduled more than once. Jobs may be selected to run out of order in edge cases, but this is not critical. Is it possible to create a unit test that reproduces the issue?
I've not been able to write a test that reproduce this, but we have metrics in our production system (not open source) and we regularly see events from our instrumented version of this: https://github.com/reugn/go-quartz/blob/master/quartz/job.go#L165
I believe what can happen is that a job can be running in a non-blocking manner. here and then, we immediately re-dispatch it here, and if the first copy of the job is still running (exacerbated in some situations where the interval is quite short and then outdated threshold) then it could try and run, not only while the long running job is still running, but long before it would otherwise be due.
Does that make more sense?
In some ways this is the counter proposal to the other PR: Here we say "Don't try to schedule a job unless it's at/past it's "next run time". That's a big pr and it touches a lot of pieces to avoid dropping jobs, but I'm pretty confident that jobs will never get stuck. This is a pretty robust solution (assuming trigger times aren't in the past, I suppose?)
This would be a big change, and it would mean that jobs would always run after the trigger time, whereas in the current system, jobs will run at the trigger time, or a bit before. This PR is is more consistent and predictable at the expense of being different, I suppose.
On the other hand by shortening outdated, potentially a lot, we get a lot of the same benefit with much less code change, and users have more control.
Let me know what you think.
@reugn have you had time to think more about this?
@tychoish, I've merged your PR (#56) with the changes to make the outdated threshold configurable. This is a quick modification to add more control over scheduled jobs. I would hold off on this proposal until we have a confirmed use case for a unit test. At the same time, a test run can be performed using shorter periods.
I don't believe that this is final (and many of the tests don't work, yet with this configuration which I expect a bit.) However, I wanted to submit it here as a conversation starter:
isOutdated
function will run a job if it's going to be "due" for the next 30 seconds. We regularly run jobs with very short intervals (the one that's experiencing the most "collisions" runs every 5 seconds, so maybe a better fix is to just remove this 30 second buffer entirely.Anyway, please let me know what you think, and I'm not committed to this particular patch at all.