scheduler: avoid dispatching jobs too early

tychoish commented 1 year ago

I don't believe that this is final (and many of the tests don't work, yet with this configuration which I expect a bit.) However, I wanted to submit it here as a conversation starter:

Even with the isolated job stuff that I added a few months ago, it seems that we're still trying to dispatch a single job in a scheduler more than once. My current (albeit very much working) theory, is that because the priority queue is maintained asynchronously with regards to the waiting/dispatching, we could begin sleeping and waiting for the "next" job in the queue, and by the time we wakeup from sleeping another job could be ready? It seems a bit far fetched, however...
As a, perhaps more plausible theory (or contributing factor), the isOutdated function will run a job if it's going to be "due" for the next 30 seconds. We regularly run jobs with very short intervals (the one that's experiencing the most "collisions" runs every 5 seconds, so maybe a better fix is to just remove this 30 second buffer entirely.

Anyway, please let me know what you think, and I'm not committed to this particular patch at all.

codecov-commenter commented 1 year ago

Codecov Report

Merging #55 (20b69f2) into master (4d78f5c) will increase coverage by 0.25%. The diff coverage is 94.73%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master      #55      +/-   ##
==========================================
+ Coverage   86.22%   86.48%   +0.25%     
==========================================
  Files           8        8              
  Lines         530      540      +10     
==========================================
+ Hits          457      467      +10     
  Misses         52       52              
  Partials       21       21

Impacted Files	Coverage Δ
quartz/scheduler.go	`93.04% <94.73%> (+0.39%)`	:arrow_up:

tychoish commented 1 year ago

I've poked around this some more and have have fixed some broken tests. The cause of the test failure was a pretty silly error on my part around cleanup up after a timer reset, and I think the premise of this, combined with #56 pretty good, and boils down to:

don't run jobs that aren't "ready" to be run.
do on-demand calculations the amount of time the scheduler should sleep for when waiting for the next job to be "ready" to avoid situations where the feedreader thread and the dispatching thread get out of sync with each other.

reugn commented 1 year ago

@tychoish I'm trying to understand how one job can be scheduled more than once. Jobs may be selected to run out of order in edge cases, but this is not critical. Is it possible to create a unit test that reproduces the issue?

tychoish commented 1 year ago

@tychoish I'm trying to understand how one job can be scheduled more than once. Jobs may be selected to run out of order in edge cases, but this is not critical. Is it possible to create a unit test that reproduces the issue?

I've not been able to write a test that reproduce this, but we have metrics in our production system (not open source) and we regularly see events from our instrumented version of this: https://github.com/reugn/go-quartz/blob/master/quartz/job.go#L165

I believe what can happen is that a job can be running in a non-blocking manner. here and then, we immediately re-dispatch it here, and if the first copy of the job is still running (exacerbated in some situations where the interval is quite short and then outdated threshold) then it could try and run, not only while the long running job is still running, but long before it would otherwise be due.

Does that make more sense?

In some ways this is the counter proposal to the other PR: Here we say "Don't try to schedule a job unless it's at/past it's "next run time". That's a big pr and it touches a lot of pieces to avoid dropping jobs, but I'm pretty confident that jobs will never get stuck. This is a pretty robust solution (assuming trigger times aren't in the past, I suppose?)

This would be a big change, and it would mean that jobs would always run after the trigger time, whereas in the current system, jobs will run at the trigger time, or a bit before. This PR is is more consistent and predictable at the expense of being different, I suppose.

On the other hand by shortening outdated, potentially a lot, we get a lot of the same benefit with much less code change, and users have more control.

Let me know what you think.

tychoish commented 1 year ago

@reugn have you had time to think more about this?

reugn commented 1 year ago

@tychoish, I've merged your PR (#56) with the changes to make the outdated threshold configurable. This is a quick modification to add more control over scheduled jobs. I would hold off on this proposal until we have a confirmed use case for a unit test. At the same time, a test run can be performed using shorter periods.

reugn / go-quartz

scheduler: avoid dispatching jobs too early #55

Codecov Report