vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
385 stars 68 forks source link

After 20-30 Repeated Runs, Jobs Stay Running #177

Open mayvn10 opened 8 years ago

mayvn10 commented 8 years ago

@vsivsi Issuing a question here in reply to my Node DDP Client issue. Also, thanks for your helpful insight and solutions thus far!

Referring to this same project. The problem is not the connection because the connection is sustained. However, the problem is the last run job stays running far too long than needed. The job done was done successfully but never changed back to the "completed" state.

According to the docs for this repo, the best option may be to use "jc.shutdownJobServer([options], [callback])" but if there is another way, please explain.

If we need to use shutdownJobServer, where is the best place to use it? The Meteor app or the Node app?

Also, what's a good approach to detecting if a job is running too long (ex. 10-15 mins) and executing shutdownJobServer and then restarting the job server right away? Does this package automatically restart the server after a shutdown?

vsivsi commented 8 years ago

This is almost certainly a problem with your code, and not with the job-collection package. Every job must eventually call either job.done() or job.fail(). If it doesn't then that "zombie" job will continue to show-up as "running" even though there is no worker actively working on it. Because servers can crash, network connections can drop, etc. etc, job-collection contains functionality to "auto-fail" jobs that appear to be zombies because the worker hasn't reported any progress (or logged any events) on the job within a specified time window. See the workTimeout option to jc.processJobs().

jc.shutdownJobServer() probably shouldn't be used for this purpose. The issue here is that you appear to have a path of execution out of your worker function (perhaps that "catch" you mention) where job.done() or job.fail() aren't called even though work on that job has effectively ended because the worker code hit some kind of an exception. You need to handle all exceptions and other errors in your worker function, and then either call job.done() or job.fail(), and finally always call the callback function provided by processJobs()

I obviously can't help you debug your program unless you share the code, as a complete Meteor application in its own repo. Debugging code via messages on a github issue is not productive.

mayvn10 commented 8 years ago

Thanks for the prompt reply.

Agreed, debugging code via messages is not productive.

We use job.done() in several places and we looked at every exception before, but you're right we may have missed something so we'll do another thorough check through the app once more.

I'll update this after we find what we're looking for.