riverqueue / river

Fast and reliable background jobs in Go
https://riverqueue.com
Mozilla Public License 2.0
3.34k stars 89 forks source link

Definitively fix race in periodic job enqueuer #438

Closed brandur closed 2 months ago

brandur commented 2 months ago

Related to #409. I think I finally cracked it. For real this time. No workarounds, no hacks.

If you look at the "after start" tests, they look roughly like this, start the client, then add some jobs:

startService(t, svc)

svc.Add(
    &PeriodicJob{ScheduleFunc: periodicIntervalSchedule(500 * time.Millisecond), ConstructorFunc: jobConstructorFunc("periodic_job_500ms", false)},
)
svc.Add(
    &PeriodicJob{ScheduleFunc: periodicIntervalSchedule(500 * time.Millisecond), ConstructorFunc: jobConstructorFunc("periodic_job_500ms_start", false), RunOnStart: true},
)

This would usually work as expected, with the enqueuer starting up, then jobs added, then those new jobs scheduled.

However, what can happen is that the "after start" jobs are actually added in before the enqueuer has finished entering its run loop. So, the service is starting up, makes a first call to timeUntilNextRun, and because the newly added periodic jobs have not been assigned a first run time, the code below would get a timerUntilNextRun of zero, which would cause it to enter the loop and run an insert iteration immediately, despite the jobs not actually being appropriate for scheduling as of yet:

timerUntilNextRun := time.NewTimer(s.timeUntilNextRun())

for {
        select {
        case <-timerUntilNextRun.C:

The fix is to modify timeUntilNextRun so that it ignores periodic jobs that have not yet been scheduled:

for _, periodicJob := range s.periodicJobs {
    // Jobs may have been added after service start, but before this
    // function runs for the first time. They're not scheduled properly yet,
    // but they will be soon, at which point this function will run again.
    // Skip them for now.
    if periodicJob.nextRunAt.IsZero() {
            continue
    }

Those jobs will be scheduled soon, then timeUntilNextRun called again, and the expected "time until" values returned.

I'm confident enough in the fix that I've reverted most of #416 so the 5s jobs go back to 500ms.

Fixes #409.

brandur commented 2 months ago

Thx. Ugh, yeah you're right that there's definitely something else at work here. I'm going to leave this open and convert it to a draft to help me debug further.

brandur commented 2 months ago

@bgentry I think I FINALLY cracked this one. It was a legit bug, although not one that would ever occur IRL probably. I did a complete rewrite of the PR description to describe it. Mind taking another look?