Idea: publish a message to workers when a queue moves from empty to non-empty

seomoz / qless-core

Core Lua Scripts for qless

MIT License

85 stars 33 forks source link

Idea: publish a message to workers when a queue moves from empty to non-empty #34

Open myronmarston opened 10 years ago

myronmarston commented 10 years ago

In our workers the general pattern is to to the following in a loop:

Pop jobs until they have no more work to do.
Go into a sleep loop where they sleep for x seconds, try to pop a job, then sleep again if there's no job.

This is pretty inefficient and results in steady redis traffic even when there's no work to do.

I'd like to propose an alternate model:

When a queue transitions from having no jobs to pop to having jobs to pop (whether via move, retry, put or whatever) have it publish a message like "jobs_available" in a channel that is named after the queue.
Then the workers could have use sleep (with no arg) to sleep indefinitely, until it's subscriber gets a jobs_available message, at which time it would use Thread#run to wake up the worker.

This would result in much less redis traffic and would allow workers to get jobs immediately when they are put on the queue rather than waiting through the 5 second (or whatever) sleep we currently use.

One "gotcha" with this, though: with scheduled/recurring jobs, jobs are put on the queue when a worker calls pop or peek, causing qless to make the state of that stuff consistent. so if nothing calls pop or peek, it'll never move the scheduled job to the waiting state and never notify workers. Thus, we may want to do something like use a long sleep (rather than an indefinite one) or consider having the parent process call peek periodically.

dlecocq commented 10 years ago

The reason this doesn't exist is because of scheduled / recurring jobs.

And the inefficiency is very manageable. As part of a recent investigation I ran some benchmarks against a snapshot of one of your production redis servers and peek cost about 250 microseconds of actual redis time.

Longer polling intervals also don't hurt as much as you might think. All they affect is the amount of time jobs wait in queues, but generally not throughput. In the specific cases of the ruby and python clients, if a worker successfully pops a job, it will try popping another after it completes that job. It's only when a queue has no jobs available that the sleep interval is invoked. So for busy queues, the penalty for polling is very small.

Fortunately, most pipelines are robust against additional latency, especially for the difference between say, 10 seconds or 60 seconds. Or 5 minutes, for that matter. If you have 100 worker processes and a 5-minute polling interval, on average a worker will pop every 3 seconds. If you then dump a lot of jobs in those queues, the workers will all be busy within 5 minutes and stay busy until all the jobs are done. So all that interval will have changed is that the whole giant job batch will finish a couple minutes later than it normally would have, with 1/30th the number of pop requests.

xmywang commented 10 years ago

Can we break the current worker mode? My situation is: I have the jobs need to be processed and then sent some information to other part via http post. We have two modes: The sync mode: A worker picks up set of jobs from the queue. For each job, worker sends a http request and then wait for http response and then mark the job is completed. After all jobs are finished, the worker picks other set of jobs from the queue. Then repeat the previous steps. The async mode: The worker picks up set of jobs from the queue. For each job, worker sends async http request without waiting for http response. After sending all jobs, the worker picks up other set of jobs from the queue and send http request. When the http response hits back, the callback from async http library will mark job is completed.

The sync mode fits to qless. However, we have to fork lot of workers for high volumes jobs. Sound the scalability is not good. Not sure asyn mode fits to qless or not since in this case, the previous jobs are not done yet but the worker picks other set of jobs. However, you can image that in this way, we do not need to fork so many workers to handle high volume jobs.

My question is that Can we use the async mode with qless?

dlecocq commented 10 years ago

(Copied from a mailing list thread):

Hi Yan, there are a couple of options here.

The first is that you can avoid using one of the built-in qless workers. Then you'd write some code that essentially called pop to get some jobs, instantiate the corresponding HTTP requests, only calling the corresponding complete call when the HTTP request has finished.

That said, this sounds more like a new worker model. Where the ruby client has a forking and serial worker, the python client has a forking, serial, and greenlet (Fibers in ruby parlance) modes. This is ideal for exactly your use-case -- where you don't want to pay the overhead of additional processes just to get your work done.

As one data point, many of our crawlers use the python qless client, often with about 8 processes and 50 greenlets per process. Depending on how lightweight your requests are, there's no reason you couldn't use hundreds or thousands of greenlets in a single process. Though I'm less familiar with the Fiber-compatible HTTP-request libraries of ruby, I imagine that similar behavior is attainable.

Ultimately, my suggestion is to make a new worker which inherits from the forking worker, but each child process uses either a thread- or fiber-pool to run each job. That way you still get the same semantics of pop-run-complete for each job, but without the overhead you were hoping to avoid. Such a worker would be welcome in the qless repo :-)

xmywang commented 10 years ago

Hi! Dan,

Thank you for replying. Our project is using java. So we have threads as workers. We do not fork the new process. The scalability for the first solution is very good. However, in this case, a worker (thread) can pick up many many jobs without finishing them. The UI provided by qless could show that lot of jobs under a worker. Also, the "complete" a job needs to pass in the worker's name. I am not sure if it is unrelated to the real application worker (thread, process etc) or not. For example, if the real application thread (worker) is stuck. The qless needs to pass the job to other thread (worker).

Based on your suggestion, we can have the thread (worker) which pulls the jobs. Then it dispatches all http requests from the jobs to async http library. After dispatched, it waits for the reply from HTTP server (callback from asyn http library) and update the qless job status. This is better than synch mode but slightly worse than the first one in scalability.

dlecocq commented 10 years ago

The worker is actually essentially just a unique string identifying the entity that's doing the processing. This has been left intentionally vague so as to fit into many different paradigms. For instance, the ruby bindings use the worker name including a hostname and PID combination, where the python code treats all worker processes as being on the same worker. So, as long as you are passing in the same string for complete as what you passed in on the pop command that yielded a job, you'll be fine.

Given that you are tied to using callbacks, it seems reasonable to me to have one thread pop and dispatch the requests and complete them in the response callback. Bear in mind, all the job rules (like heartbeating) still apply, so correct configuration will likely be important.

xmywang commented 10 years ago

Thanks a lot!