vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

concurrency dropping down #196

Open chemuto opened 8 years ago

chemuto commented 8 years ago

Hi Vaughn. Really like your package. We use it on most of your meteor apps and works great. Congratulations.

We are recently experiencing an issue with the jobs concurrency. Seems like it is slowing dropping down.

For example, for our deployment, we are running on a 36 core instance using the cluster package. I'm giving a certain job type a concurrency of 10, which means it could handle a max of 360 simultaneously running jobs. That works fine for a while, being 360 the maximum number of running jobs of that type, but that number slowly starts to decrease. Everything else looks good, but this issue eventually causes the queue of ready objects to overgrow and affect the app performance. A server restart resets the concurrency back to normal, 360, but slowly starts to drop down again.

Do you have any idea of what could be causing that issue or have you experienced something similar yourself?

Using meteor 1.4.1.1 and node v4.5.0

Thanks in advance.

vsivsi commented 8 years ago

Hi, yes this has been a fairly common issue people have run into. Please see: https://github.com/vsivsi/meteor-job-collection/issues/167 (and others it links to)

The TLDR; is that there is probably some rarely taken path out of your "worker" function that skips invoking the callback function. If that ever occurs, the worker queue entry for that item is "stuck" until you restart that worker process.

How can that happen? Well it could be as simple as a stray return someplace that isn't preceeded bycb(), but more typically it is because a rare exception occurs within the worker function that is caught and handled outside of that context, resulting in the cb() invocation being skipped:

try {
    var workers = Job.processJobs('myJobQueue', 'sendEmail',
      function (job, cb) {
        var email = job.data; 

        // If sendEmail throws, then this queue slot will get stuck

        sendEmail(email.address, email.subject, email.message,
          function(err) {
            if (err) {
              job.log("Sending failed with error" + err,
                {level: 'warning'});
              job.fail("" + err);
            } else {
              job.done();
            }
            cb();
          }
       );
    );
} 
catch (e) {
   // Do something ...
}

The solution is to be sure to catch and handle any possible thrown exceptions within the context of the worker function, so that you can ensure that the job either fails or succeeds and the callback is called:

    var workers = Job.processJobs('myJobQueue', 'sendEmail',
      function (job, cb) {
        var email = job.data; 

        // sendEmail might throw, so catch that error here!

        try {
           sendEmail(email.address, email.subject, email.message,
             function(err) {
               if (err) {
                 job.log("Sending failed with error" + err,
                   {level: 'warning'});
                 job.fail("" + err);
               } else {
                 job.done();
               }
               cb();
             }
           );
       }
       catch (e) {
          job.log("Sending failed with exception" + e,
            {level: 'error'});
          job.fail("" + e);
          cb();
       }
      }
    );
satyavh commented 7 years ago

I'm having same issue, jobs getting somehow stuck in 'waiting' and never ever getting picked up again, resulting in no jobs getting picked up again. Very frustrating, I'm doing explicit callbacks everywhere.

Ontopic: in your example, there is no cb() in catch, shouldn't it be there?

vsivsi commented 7 years ago

@satyavh Right, needs a cb() in the catch clause as well. This is an easy bug to introduce!

As for your experiences, this is a fundamental aspect of async programming. It's fundamentally no different than using a priorityQueue in the popular async npm package (except that the queue and workers are distributed.) In either case, the worker callback must always be called, or the queue will fill-up and stop running new jobs.

Anyway, I'm sure that you appreciate that I can't really help with your situation without a minimal reproduction and actually seeing the offending code.

Something pretty simple to try: Create a shadow job-collection, and a second pool of workers, that exactly mirror your real workload. With the only difference being that the shadow workers only invoke setTimeout() or something similar. If you then create a shadow job for every real job, you will very likely see that the shadow work queue does not get jammed up (under the same number of jobs/workers) as the real one does. That will be your evidence that something is awry in your worker code. Alternatively, if you notice problems with the shadow workers, then you will have a simple reproduction case that you can share.

Sample code for a trivial shadow work queue:

var sq = Job.processJobs('shadowQueue', 'shadowJob', function (job, cb) {
   function work() {
      job.done();
      cb();
   }
   setTimeout(work, 5000); // Or Meteor.setTimeout()
});