resque / resque-scheduler

A light-weight job scheduling system built on top of Resque
MIT License
1.74k stars 480 forks source link

Adding batching to re-queuing for timestamp #767

Closed brennen-stripe closed 1 year ago

brennen-stripe commented 1 year ago

During times of high load on our system, we noticed that our queue was behaving poorly. One of the things we noticed was that our scheduled job queue was lagging behind real time. In other words, the re-scheduling was dealing with timestamps that were in the past, sometimes by as much as 15 minutes.

We added a metric around this to keep track of it, you can see an example chart below.

Screen Shot 2023-03-01 at 12 59 02 PM

After diving into the Resque-Scheduler code, it became apparent that the delay was stemming from inefficient queueing of scheduled jobs. This PR aims to fix that by adding batch scheduling, with a customizable batch size.

After sourcing my fork of this repository with this patch applied in our production environment, we have seen the delay between the scheduler and real time virtually eliminated. You can see for yourself in the chart below.

Screen Shot 2023-03-06 at 10 59 10 AM

PatrickTulskie commented 1 year ago

Hey @brennen-stripe... can we separate out the feature/tests from changes to the test matrix in another PR? Also, can we get a PR description so we understand what this is changing? Thanks!

brennen-stripe commented 1 year ago

@PatrickTulskie Apologies, added WIP and moved this to a draft. Will be removing the matrix changes, just wanted to sanity check some version compatibility and fix the unrelated the test failures to see green builds. I see theres another PR duplicating that work, so won't make a new one.

brennen-stripe commented 1 year ago

@PatrickTulskie Alright this should be good to go!

iloveitaly commented 1 year ago

All tests except one is passing. The one that isn't passing does pass on a previous ruby version, and it's ruby 2.3, so we can handle separately.

alxckn commented 5 months ago

@brennen-stripe your initial issue with lag at stripe seems to have been quite severe (multiple minutes on average). I'm wondering how far down this fix managed to get the average/p95 down for you (few seconds?) and if you are using a custom batch size higher than the default of 100?