vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

[Question] Can I disable concurrency when multiple worker agents are present? #178

Closed oliverkane closed 8 years ago

oliverkane commented 8 years ago

If I want to scale out my worker nodes to multiple instances, is there any built in support for preventing multiple agents from trying to execute the same job?

In other words, is this collection cluster-ready?

vsivsi commented 8 years ago

Short answer: By design, under normal conditions, multiple workers should never try to execute the same job.

Longer answer:

TL;DR: Scaling is hard, and job-collection will scale correctly to the extent that Meteor and MongoDB do (in your specific deployment.)

if your workers, meteor server(s) and mongoDB server(s) are susceptible to network partitions, then it is possible for a worker working on a job to become disconnected from the job-collection itself. If you have the server configured to "auto-fail" such jobs (that appear to be running, but the worker is AWOL) then another worker can be allocated to that job even while the old worker continues to make progress and may ultimately have successfully completed the job some time after the network healed itself. In this case, that job will effectively run twice, but the "old" job will not be able to complete because the job-collection will have marked that runId as failed. This illustrates why it is important for long running jobs to "check-in" with the server periodically (by reporting progress, or logging status) and to abort (or pause) processing the job when those attempts fail.

The other class of problems that can happen with databases (such as mongoDB) that can be sharded and/or replicated across globally distributed data centers, is that network partitions (among the MongoDB instances) can potentially lead to consensus problems with distributed reads and writes. The consistency and atomicity of modifications to a dataset (such as a job-collection) is entirely dependent on the underlying data store doing the right thing in every single case. The performance of MongoDB under these conditions depends on which version you are using, how the DB cluster is configured, and how reads and writes are set-up (in MongoDB jargon, the "concern" configuration). Meteor does a good job of hiding these issues from the developer, which makes the system simpler, but has the potential to cause problems when scaling applications globally, depending on what settings they have selected internally (which, being internal, may change.) Job-collection is built on top of Meteor collections and ultimately whatever MongoDB instance(s) you deploy it to use, so (as with any application that uses a database) it can only be as robust as those underlying layers are configured (and tested) to be in deployment.

oliverkane commented 8 years ago

Digesting, but while I'm reading, thanks greatly for the prompt and lengthy reply!

oliverkane commented 8 years ago

This answers my question. If i understand correctly, there isn't a distributed locking mechanism, but there are features that at least attempt to avoid overlap. I can assume that under normal operation this won't occur, but need to take steps to protect true mutual exclusion zones with custom methods.

Would that be correct?

vsivsi commented 8 years ago

Right. If your app has a mission critical process that requires that certain jobs really and truly only execute (or more precisely, commit state) exactly once, and you want that to be true for scaling to an arbitrarily large number of geographically distributed instances, then yes, I wouldn't depend on Meteor + MongoDB to automatically achieve that level of correctness. I'm not saying it isn't possible, but you would need to seriously validate your deployment.

I would personally feel more comfortable in such a case adding a much simpler (than Meteor + mongoDB) distributed datastore to the mix to provide explicit synchronization of committed results of the job. A package like etcd springs to mind as a good possible solution under these conditions. But as always, YMMV.

oliverkane commented 8 years ago

If you're familiar with Underscore's "Debounce" (http://underscorejs.org/#debounce) function, my goal is to execute a job not on any particular schedule, but when it's signaled to be ran. Problem is, each signal need not be ran as a single instance of the task, since signals often happen in large groups.

At present, I've managed to "fix" the problem by introducing a scalablity issue, an restricting running to a single instance of my worker. What I really want to do is something such as:

"Run this job after you stop getting signals for X time, but run anyways if no such quiescence happens after Y time".

Doing this at scale is, as you said, very difficult. Since that accumulation of signals-to-be-processed is often larger than a single Job.

I'll look into etcd. Thanks for the suggestion!