vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

Most efficient time setup? #224

Open petr24 opened 7 years ago

petr24 commented 7 years ago

Hey Vaughn,

I am wondering if you know what might be the most efficient option setup I can run.

Real example I tested. 200 same exact jobs for this test. Each job takes ~200ms to process because I slow them down to a minimum of 150ms + 3rd party service api call which varies.

1st Test: Concurrency: 1000, Prefetch: 1000 PollInterval: 5000 Promote: 5000

2nd Test: Concurrency: 1 Prefetch: 1 PollInterval: 5000 Promote: 5000

3rd Test: Concurrency: 55 (Thought process behind this was it will take around ~5s to process 50 jobs, so by the time pollInterval checks again, the queue will just be running out, and add more jobs so it doesn't wait 5s) Prefetch: 55 PollInterval: 5000 Promote: 5000

4th Test: Concurrency: 1 Prefetch: 55 PollInterval: 5000 Promote: 5000

Between theses 4 test, the total time it takes to process jobs, including some that fail (max out retires) ~72s.

It seems as though my config is almost arbitrary with concurrency and pollinterval. What I am aiming for is to have the queue running all the time (processing jobs) but never actually wait 5s for ready jobs. So if I have 200 Jobs, thinking it should take ~40s instead of 72s. If I console it out, I can see the 5s pollinterval delay.

Could I be messing something up or just misunderstanding some fundamental idea of queuing?

Thanks!

vsivsi commented 7 years ago

Hi, Thanks for posting this. When using .processJobs() it is useful to know that it is already optimized for throughput. The prefetch functionality (when used) works hard to ensure that the worker is never waiting for jobs whenever they are available. The whole point of prefetching is to hide the underlying getWork() network/database latency by doing it in the background while the worker is busy with (con)-current jobs.

Another useful thing to know is that the local queue doesn't wait until the pollInterval to request more jobs to work on. The pollInterval (or alternatively, doing a watch/trigger) only comes into play when the worker's queue is empty. That is, when there are no jobs being worked on or waiting to be worked on in the local processJobs queue. In that case, the polInterval sets how often it should check-in with the server to see if some jobs have shown up. (And as I mentioned, if you don't want that latency, then you can observe a query on the jobCollection and q.trigger() to ask for work when new jobs are observed.)

So what is happening the rest of the time? Well, every time the worker function finishes with a job and invokes the callback function, the processJobs queue will automatically request more work from the server, unless all of the concurrency + prefetch queue slots are already full...

So if you have a very busy worker, the pollInterval setting will make very little difference in throughput.

Likewise, the promote interval on the server doesn't matter at all for jobs that are "ready to run" as soon as they are scheduled. If you have jobs set to run in the future, or that are repeating or retrying then it matters for latency, but not throughput.

Anyway, with all of that out of the way, it seems that at least some of the latency you are seeing is because you are measuring from a "cold start". If the jobCollection has no jobs, and the worker has no jobs, and if for some reason a bunch of jobs are saved all at once with some kind of delay...

Then the average latency from the jobs being promoted to ready, and then noticed and obtained by the worker, will be around 5000ms + plus network and database round trips (10s-100s of ms).

If you want to squeeze out every bit of this, then it can be minimized by setting the server promote cycle to some thing much shorter (say 500ms), and then eliminating polling in the processJobs() setup (by setting pollInterval: false) and then setting up an observe/trigger for the worker: https://github.com/vsivsi/meteor-job-collection#qtrigger---anywhere

If you do those two things, you should see you job start latencies begin to approach sub-second timings (assuming your network/database latencies aren't terrible). All of this is with the caveat that jobCollection/Meteor/Javascript are all "best effort" systems with respect to timing/latency of scheduled events. These are not strict realtime systems, so there are no guarantees.

Hope that helps.

vsivsi commented 7 years ago

I should add two more thoughts:

1) It would be helpful if I could see your code. If you can post a "minimal" repo with just the server/worker jobCollection code, and omit the API call, that demonstrates the timing you are seeing, I'd be happy to take a look.

2) Could it be that you are actually saturating the worker process? "Concurrent" here is in the node.js "async" sense of "concurrent", not in a true parallel processing "use all cores" sense of it. This doesn't seem likely given what you've described, but it bears mentioning that one way to increase throughput in jobCollection is to add worker processes!