vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

Explanation of "Running job not found" error #255

Open niftykins opened 6 years ago

niftykins commented 6 years ago

Over the last few weeks I've been getting a number of these errors, starting with just a few to a whole lot now. I can't seem to figure out why they're happening by going through the job-collection codebase so I'm hoping someone here is able to shed some light into why they occur.

Example of error output:

Running job not found 3BbsFvByM73DaMqF2 6EqAaJ7AgZ5GaTpCe
Running job not found ucwofx8onQS69fidP 6EqAaJ7AgZ5GaTpCe
Running job not found yyBFwSt58srxqtHAb 6EqAaJ7AgZ5GaTpCe
Running job not found 6FwN8jzbtcHWfyqtP 6EqAaJ7AgZ5GaTpCe
Running job not found BWcCWpzh6xB4LgEvf 6EqAaJ7AgZ5GaTpCe
Running job not found feSYK2s7fYeJuyaRT 6EqAaJ7AgZ5GaTpCe
Running job not found Xt7nkNNDZ67rZLxf9 6EqAaJ7AgZ5GaTpCe
Running job not found xmx9dSAy3Q6FKwDLa 6EqAaJ7AgZ5GaTpCe
Running job not found LzsFdkGtHm5Zmb5Sj 6EqAaJ7AgZ5GaTpCe
Running job not found ucwofx8onQS69fidP 6EqAaJ7AgZ5GaTpCe
Running job not found Xt7nkNNDZ67rZLxf9 6EqAaJ7AgZ5GaTpCe
Running job not found xmx9dSAy3Q6FKwDLa 6EqAaJ7AgZ5GaTpCe
Running job not found yyBFwSt58srxqtHAb 6EqAaJ7AgZ5GaTpCe
Running job not found 3BbsFvByM73DaMqF2 6EqAaJ7AgZ5GaTpCe

Happy to provide more specific info into the setup/jobs if there isn't a typical reason for why these errors occur.

Thanks

vsivsi commented 6 years ago

"Running job not found" is logged to the console when a worker tries to run job.done() or job.fail() on a job that is no longer considered to be in the running state by the server.

The job is cancelled (or failed) on the server while running, but the worker tries to complete (or fail) the job before it learns that the job is no longer running.

These warnings may be a "normal" and non-fatal consequence of this being a distributed system. If you see lots of these, they probably indicate that your worker code is not doing regular check-ins with the server via job.progress() / job.log() and checking the return value, so if your worker doesn't do that (and stop running), and/or if there is a race between a worker and server cancel/fail and the server wins, you will see these messages.

This would also happen if worker code erroneously calls job.done() / job.fail() more than once for a given run of a job.

Another possibility is if you are using the workTimeout option of jc.processJobs() and you are seeing lots of these warnings, it probably means that you either need to increase the timeout value or your worker code needs to check-in more frequently (or perhaps you are having other issues like network connectivity problems between the worker and server).

Anyway, you should be able to troubleshoot what is happening in your cases by examining the objects in the .log[] array on the affected job documents in the JobCollection. A detailed log of all successful JC state changes (except progress changes) for each run are kept there. It may be that you have some bug in your code that is triggering this (the reason for the warnings!) or perhaps you are just seeing the "normal" effects of possible races in a distributed system (the server and worker are not in perfect sync, and when it matters the server state is the ultimate "source of truth")

Hope that helps.