vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
385 stars 68 forks source link

minimum repeat job interval #207

Open tzapu opened 7 years ago

tzapu commented 7 years ago

Hi,

No matter what settings i use for repeating jobs, it seems like it doesn't run faster than once every 30 seconds. Is this expected?

observer:

Queue.find({ status: 'ready' })
  .observe({
    added: () => {
      runningQueue.trigger();
    },
  });

job processor

const runningQueue = Queue.processJobs(
  Object.values(jobTypes),
  {
    concurrency: 10,
    payload: 1,
    pollInterval: 5000,
    prefetch: 1,
    workTimeout: 30 * 1000, //expire running tasks that stalled more than 1/2 min
  },
  function (job, callback) { 
...

and job scheduling

new Job(Queue,
    jobTypes.CRON_EVERY_MINUTE,
    {
    })
    .priority('normal')
    .repeat({ repeats: Queue.forever, wait: 15000 })
    .save({
      cancelRepeats: true,
    });
});

with the above it runs exactly at 30 seconds...

db snippet of two consecutive runs

"_id" : "xHHchmPZC99KxH4yw",
"runId" : "bCKqGjJ3KySBqfoMT",
"type" : "CRON_EVERY_MINUTE",
"data" : {

},
"created" : ISODate("2017-01-05T11:25:36.946+0000"),
"priority" : NumberInt(0),
"retries" : NumberInt(0),
"repeatRetries" : NumberInt(1),
"retryWait" : NumberInt(300000),
"retried" : NumberInt(1),
"retryBackoff" : "constant",
"retryUntil" : ISODate("275760-09-13T00:00:00.000+0000"),
"repeats" : 9007199254740971.0,
"repeatWait" : NumberInt(15000),
"repeated" : NumberInt(21),
"repeatUntil" : ISODate("275760-09-13T00:00:00.000+0000"),
"depends" : [

],
"resolved" : [

],
"status" : "completed",
"updated" : ISODate("2017-01-05T11:26:06.946+0000"),
"progress" : {
    "completed" : NumberInt(1), 
    "total" : NumberInt(1), 
    "percent" : NumberInt(100)
},
"log" : [
    {
        "time" : ISODate("2017-01-05T11:25:36.946+0000"), 
        "runId" : null, 
        "message" : "Rerunning job", 
        "level" : "info"
    }, 
    {
        "time" : ISODate("2017-01-05T11:26:06.927+0000"), 
        "runId" : null, 
        "message" : "Promoted to ready", 
        "level" : "info"
    }, 
    {
        "time" : ISODate("2017-01-05T11:26:06.937+0000"), 
        "runId" : "bCKqGjJ3KySBqfoMT", 
        "message" : "Job Running", 
        "level" : "info"
    }, 
    {
        "time" : ISODate("2017-01-05T11:26:06.947+0000"), 
        "runId" : "bCKqGjJ3KySBqfoMT", 
        "message" : "Job Completed", 
        "level" : "success"
    }
],
"after" : ISODate("2017-01-05T11:25:51.946+0000"),
"workTimeout" : NumberInt(30000),
"expiresAfter" : ISODate("2017-01-05T11:26:36.936+0000"),
"result" : {

}
"_id" : "Lpm9cbG68tqJvwCt9",
"runId" : "q2fuTdXphWdbyQ5XY",
"type" : "CRON_EVERY_MINUTE",
"data" : {

},
"created" : ISODate("2017-01-05T11:26:06.948+0000"),
"priority" : NumberInt(0),
"retries" : NumberInt(0),
"repeatRetries" : NumberInt(1),
"retryWait" : NumberInt(300000),
"retried" : NumberInt(1),
"retryBackoff" : "constant",
"retryUntil" : ISODate("275760-09-13T00:00:00.000+0000"),
"repeats" : 9007199254740970.0,
"repeatWait" : NumberInt(15000),
"repeated" : NumberInt(22),
"repeatUntil" : ISODate("275760-09-13T00:00:00.000+0000"),
"depends" : [

],
"resolved" : [

],
"status" : "completed",
"updated" : ISODate("2017-01-05T11:26:36.948+0000"),
"progress" : {
    "completed" : NumberInt(1), 
    "total" : NumberInt(1), 
    "percent" : NumberInt(100)
},
"log" : [
    {
        "time" : ISODate("2017-01-05T11:26:06.948+0000"), 
        "runId" : null, 
        "message" : "Rerunning job", 
        "level" : "info"
    }, 
    {
        "time" : ISODate("2017-01-05T11:26:36.934+0000"), 
        "runId" : null, 
        "message" : "Promoted to ready", 
        "level" : "info"
    }, 
    {
        "time" : ISODate("2017-01-05T11:26:36.941+0000"), 
        "runId" : "q2fuTdXphWdbyQ5XY", 
        "message" : "Job Running", 
        "level" : "info"
    }, 
    {
        "time" : ISODate("2017-01-05T11:26:36.950+0000"), 
        "runId" : "q2fuTdXphWdbyQ5XY", 
        "message" : "Job Completed", 
        "level" : "success"
    }
],
"after" : ISODate("2017-01-05T11:26:21.948+0000"),
"workTimeout" : NumberInt(30000),
"expiresAfter" : ISODate("2017-01-05T11:27:06.940+0000"),
"result" : {

}

thank you for a wonderful package

P.S. the irony that i have a job named cron every minute that i set to run at 15 seconds intervals and it actually runs at 30 seconds is not lost on me....

vsivsi commented 7 years ago

Hi, thanks for your question. The first thing to point out is that job-collection is not designed to be a "real-time" system, in that all scheduling is "best effort" with no guarantees that a given job will run exactly when requested.

With that said, there are a couple of places where latency can be squeezed out of job-collection if you care about short repeats.

The first thing to look at is the setting of jc.promote() on the server. It defaults to 15 sec. This controls how often the server "wakes up" to look for waiting jobs that can be promoted to ready and made available to workers. If your jobs are exactly in sync with this heartbeat, that would be contributing 50% of the 30sec latency you are seeing. Using something like jc.promote(1000) should help a lot.

The second thing to know is that the delay for a repeating job is calculated from the end of the previous run. So if you schedule a job to run with a job.repeat({wait:30000}) and the worker runs for 30 seconds before calling job.done() the next job will run 30 seconds later, yielding throughput of about one per minute. So you need to factor the average expected run time of each job into the wait time when you get into such short repeat cycles.

A third thing that you appear to have already done is to use jq.trigger() on an observe instead of polling for new work in your workers.

Anyway, it should definitely be possible to schedule jobs that run on average about every 15 seconds. Hope that helps!

tzapu commented 7 years ago

hi,

thanks for the in-depth answer. as i understand it, this will really only apply to scheduled jobs, so that s not such a big deal anyway.

i assume because of jc.promote and my delay of 15 seconds, it gets set at 30 seconds effectively between runs. ot's either that or jc.promote gets to run twice before my task is set to ready.

if it's the latter by any chance, and if it s not doing this anyway, would it be worth for promote to get all tasks that are ready to be set to ready and the ones that will be ready in the next interval before the next promote?

not sure if i managed to explain properly

thanks

vsivsi commented 7 years ago

If jc.promote() is the default 15s, then on average it will take 7.5s for a job whose wait time has passed to be marked as ready. Then there will perhaps be something like another second or two for the various DB accesses and network latencies for the job to be marked as ready, that change to be propagated to the observing workers, one or more of the workers to request the job (assuming one is idle) and for the job to be marked as running and returned to the worker. Similar latencies exist for a repeating job to be marked as done and a new repeat to be scheduled.

Within a fast local network, I would say that the fastest a repeating job could possibly run (assuming zero actual work, no delay scheduled between runs, and a very short promotion interval on the server) would be a couple of times per second. So it should definitely be possible to run a job approximately every 15 seconds unless the job itself takes almost that long to run, or your servers/network are slow.

tzapu commented 7 years ago

hmmm, in that case there must be something else that is happening to my job jc.promote is 15 secs job repeat is 15 secs job takes less than 1 second they run in meteor sandobx, so in meteor, same local mongo, same machine yet they happen exactly every 30 seconds appart, doesn t seem to vary

I20170113-19:42:40.139(2)? CRON
I20170113-19:43:10.140(2)? CRON
I20170113-19:43:40.143(2)? CRON
I20170113-19:44:10.167(2)? CRON
I20170113-19:44:40.153(2)? CRON
I20170113-19:45:10.158(2)? CRON
I20170113-19:45:40.156(2)? CRON
I20170113-19:46:10.183(2)? CRON
I20170113-19:46:40.170(2)? CRON
I20170113-19:47:10.174(2)? CRON
I20170113-19:47:40.226(2)? CRON
I20170113-19:48:10.204(2)? CRON
I20170113-19:48:40.188(2)? CRON
I20170113-19:49:10.196(2)? CRON
I20170113-19:49:40.197(2)? CRON
I20170113-19:50:10.231(2)? CRON
I20170113-19:50:40.209(2)? CRON
I20170113-19:51:10.209(2)? CRON
I20170113-19:51:40.261(2)? CRON
I20170113-19:52:10.240(2)? CRON
tzapu commented 7 years ago

just to add some more to this, a job scheduled to repeat every 10 secs runs exactly every 15secs

    new Job(Queue, jobTypes.USER_NOTIFICATIONS, {})
      .priority('normal')
      ////.retry({ retries: 0 })
      //.repeat({ repeats: Queue.forever, wait: 60000 })
      .repeat({ repeats: Queue.forever, wait: 15000 })
      .save({
        // Cancel any jobs of the same type,
        // but only if this job repeats forever.
        // Default: false.
        cancelRepeats: true,
      });
vsivsi commented 7 years ago

Can you point me to a minimal reproduction of this behavior? I'd be happy to take a look if you can provide a working (simple) project that demonstrates these timings.

tzapu commented 7 years ago

hi and thanks.

i will surely try, i can only get to it early next week though. will update here

cheers

vsivsi commented 7 years ago

No problem. I probably wouldn't be able to look at it until sometime next week anyway.