Job with retry runs once, then gets stuck as ready

luixal commented 7 years ago

Hi,

What I'm trying to do is quite complicated, but I'll try to sum it up: I have a job that inserts some docs in a collection. Each time the job runs the number of inserted docs is different based on some conditions so, for non-blocking in a long time, the job generates a fixed number of docs and the fails to be retried later.

I'm trying to get this done with an easy approach:

Jobs runs and generates 100 docs.
If it has generated all needed docs, it call job.done()
if not, it calls job.fail() so it gets retried a little later.

The problem I'm having is that job runs once and the it gets stuck in the ready status.

This is the code for the processJobs:

jobQ = jobs.processJobs(
  'scheduleServices',
  {},
  function(job, callback) {
    generateDocs();
    // if all docs have been generated:
    if (finished) {
      // marks job as done:
      job.done();
      callback();
    } else {
      console.log('job NOT completed');
      job.fail(
        {
          reason: 'Not finished',
          code: 44
        }
      );
    }
  }
);

and this is the code where I create the job:

new Job(
        jobs,
        'scheduleServices',
        {
          serviceScheduleId: doc._id
        }
      )
      .retry({wait: 1000})
      .save();

I've also set a jobs.promote(1000) for non-waiting so long between executions.

Any ideas why this can be stuck on the ready status after first run?

luixal commented 7 years ago

Also, as seen in another issue, I've tried setting an observer (although I think it shouldn't be necessay, is it?) on the job collection for jobs in a ready status, like this:

jobs.find({type: 'scheduleServices', status: 'ready'}).observe({
  added: function () {
    console.log('gotcha!');
    jobQ.trigger();
  }
});

and i get the gotcha! message printed in the console, but job keeps in ready status.

Chugot commented 7 years ago

I'd take a look at your queue capacity, if the GenerateDocs() function can break execution it may not be calling job.done or job.fail, and as such, your worker would not have capacity to run the next job even if it is in ready state.

luixal commented 7 years ago

@Chugot I've set the callback for job.fail() and it's working. Also, this is the only job I have in the queue and no exceptions or problems found at all (checking what the generateDocs() should have done, everything's fine).

Funny thing: if I just create a new job before calling job.done(), that second job runs with no problem so... there's something about the retry that I'm doing wrong. Problem is that I can end up having thousands of done jobs if I use this approach, that would kill our capacity to supervise these tasks (1 "task" = thousands of jobs).

:(

vsivsi commented 7 years ago

Hi, are you sure you are always invoking the worker callback function? In your code in the OP you only invoke it after job.done() but not after job.fail(). That would lead to the worker getting "stuck" whenever the fail path is taken through the code.

luixal commented 7 years ago

Nice catch, that was it! :)

Now everything is running great but i'm curious about it and as it's related to the issue... I've found only this about the callback() function in the docs:

callback -- must be eventually called exactly once when job.done() or job.fail() has been called on all jobs in result.

What does callback() exactly does? If you have to call it everytime after a job.done() or a job.failed() wouldn't it be easier if callback() is automatically done inside those functions?

Thank you guys :)

vsivsi commented 7 years ago

Great!

There are a bunch of reasons why job.done and job.fail don't automatically call the callback.

Firstly it is trivial to cause that to happen: both job.done and job.fail accept a callback parameter that they will invoke when finished.

Other reasons: you may want to handle errors returned by job.done and job.fail before releasing that worker slot. Also, the worker function can be configured to accept multiple jobs per invocation, each of which needs to have its own job.done or job.fail, but the callback is only invoked once per call to the worker function.

Anyway, there isn't a 1:1 mapping between async calls to the worker function and jobs, and the work of that function may not be finished as soon as job.done or job.fail are called.

luixal commented 7 years ago

Ok, so it's better to call callback() in job.done() and/or job.fail() callbacks. I'll move it there.

Thanks for your help, I'm closing this issue as it's already solved :)

vsivsi / meteor-job-collection

Job with retry runs once, then gets stuck as ready #210