Throttling the publisher is problematic

jtlapp commented 8 years ago

According to the documentation, the only throttling available is via a throttle period set upon publishing an event. The throttle only allows the publisher to post one task per throttle period, refusing to post the job to the queue otherwise. I'm seeing two problems with this approach:

(1) If the server needs to throttle, it's likely because the server is currently having trouble. It needs to throttle already-queued jobs in order to recover.

(2) Suppose I have an HTTP server queuing tasks in response to client requests. The server can't block the client request until the queue stops throttling. Instead, it has to queue the job and allow the next request to arrive. If the server is busy, the client will just have to wait longer for the job to be done. The client shouldn't be told, "Sorry the server is too busy to even register your request." So the job gets queued. The workaround would be to create a job-pending queue that feeds the actual job queue; the job's gotta go into a queue.

timgit commented 8 years ago

Throttling is only for opt-in rate limiting use cases. For example, if you have a costly downstream operation, such as running a report that takes 10 minutes to crunch yesterday's sales numbers, then you can set up a throttle to make sure no more than 1 job is created for that within a certain timebox. Another use case is using a 3rd party API that charges you by usage. You could configure the throttle to be within that threshold to make sure you don't exceed your quota. Unless you have a constraint like that, you shouldn't use the throttling options because it will not always create jobs.

jtlapp commented 8 years ago

Okay, it sounds like there's a good application for throttling at publication, but how do I use pg-boss to dynamically reduce the rate at which already-queued tasks are delivered to a subscriber? How do I use pg-boss to deal with situations (1) and (2) that I describe above? Thanks!

timgit commented 8 years ago

The server should not "need" to throttle. You are in control of concurrency via teamSize, so set that according to your requirements.
Job always go to the queue unless you configure throttling to sometimes discard them. I don't think your use case is valid.

jtlapp commented 8 years ago

Okay, thanks for responding, but I'm not seeing it yet. Here's a more specific use case:

I have a server that receives requests to index the images on a particular web page. The server downloads these images to the file system and converts them to thumbnails.
My downloader process subscribes to pg-boss at startup. That's when it provides the teamSize (unless I'm not understanding something).
The server goes along quite smoothly until some major news outlet reports on it. All of a sudden, the load increases tenfold.
Some images can be 100MB in size. It takes time to download them, and I can only use so many ports. Some of these requests may take hours to get serviced.

How does my process decide that it can't keep up and needs to reduce the team size? How does it decide that it needs a team size of zero for the next 2 minutes so that it can reduce the backlog a bit before attempting any more work?

What I see happening instead is that pg-boss piles up the calls to the subscription callback, and my downloader is either forced to block on those calls because it can't service them, or it has to requeue them for service later, with a delay. Requeuing is problematic because it's asking the server to do more work at a time when it's having trouble doing work. (Requeuing isn't the only work it's doing, it's also servicing each of these tasks that have to be requeued.)

What am I missing?

timgit commented 8 years ago

Some images can be 100MB in size. It takes time to download them, and I can only use so many ports.

You'll want to set the expiration setting on the job when you submit it to make sure pg-boss doesn't expire it while downloading before you mark the job as complete. The default of 15min is probably sufficient, but you can increase that if you think you need longer to download a large image.

Some of these requests may take hours to get serviced.

What do you mean by hours? The entire length of time since a job was submitted until it's eventually completed? If so, that's just the result of the total number of jobs / number of workers (teamSize setting). You can adjust teamSize according to your desired concurrency.

The nice thing about a queue architecture like this is that it scales quite well since you can control concurrency. Because of this, you don't really need to worry about "keeping up" as you said. It doesn't matter if you have normal load or a 100-fold increase in load. From the perspective of the worker, it will always only work on the configured teamSize. If it's set too low, requests will take longer to process.

If you'd like to build another service that monitors and creates new workers, you could always do that, but it sounds like a concern that falls outside the scope of what the queue itself.

jtlapp commented 8 years ago

By hours, I mean that 50 people might have queued a request to index a page before your request arrived. I still have to queue your request, even though it's going to have to wait for all of those who got ahead of you to have their work done. I don't mean that it will take hours to download and resize an image -- possibly hours to even get started on the job.

I'm not following how I can govern the rate at which a subscriber receives tasks as a function of the current load on the server. Can you point me to anything else that uses an API similar to yours, where I might be able to get a better understanding? Thanks!

timgit commented 8 years ago

I'm going off on a tangent here with you, but one way server load could be monitored is by keeping a running total of requests within a time box. One you have that metric, you could start up another worker if you want, or register another subscriber with a specified teamSize that increases capacity to what you need.

But even if you did that, I think it may be overkill. It would be better to just set teamSize to a good level that could handle a sufficient load without overloading the resources on your server. Then, if you run out of resources on that server you could provision another server to listen for jobs as well to balance the load.

timgit commented 8 years ago

And the best way to determine your concurrency would be to actually put it under load and monitor your server with a sample workload.

jtlapp commented 8 years ago

I could look at the code to answer this question, but I'll get a more reliable answer asking you: do you wait for each worker to complete a job before handing the worker the next job? If that's the case, no wonder you're having trouble understanding what my issue is, because the issue wouldn't exist.

jtlapp commented 8 years ago

(Somehow that approach seems at odds with newJobCheckInterval -- unless this configuration option determines how often the queue polls the database and not how often jobs are delivered to workers. Presumably, then you could retrieve multiple jobs on each check interval.)

timgit commented 8 years ago

When a job is found, pg-boss calls your callback in subscribe(). In terms of a workflow, the entire cycle is complete at that point. Once that is done, the worker is free to find another job and call another callback.

timgit commented 8 years ago

Yes, newJobCheckInterval is how often pg-boss queries the database looking for work

jtlapp commented 8 years ago

Oh. So the client calls subscribe once for each job? A call to subscribe delivers a single job, no more? If the client wants more, the client has to call subscribe again?

Maybe I should just study the code. For the moment I'm stubbing out the queue to get my downloading process working.

timgit commented 8 years ago

Not quite. For example, review https://github.com/timgit/pg-boss/blob/master/test/speedTest.js. It has 1 subscribe, but its callback is called 1000 times.

jtlapp commented 8 years ago

Yeah, that's what I assumed. I'm finding your words taking me all over the map. I fear there is no choice at this point but for me to study the code to figure out how to use the API. Thanks for your continued effort to break through though!

timgit commented 8 years ago

You should use the tests for now in regards to sample code. I just don't have enough time to write a lot of samples yet. :)

jtlapp commented 8 years ago

It may also just reveal itself through reverse engineering, as I start to use the API. I'll have a similar test. If I can't figure it out, hope you don't mind my asking you how to modify the test to reduce the rate at which tasks deliver to the subscriber for a period of time. I'll post my module on github.

timgit commented 8 years ago

If you want to look elsewhere for inspiration or examples, I originally modeled my API after kue, a Redis-backed queue, but really any queue package would work the same way. With the exception of adding throttling support, I don't think I implemented anything original that hasn't been done before. There are probably hundreds of queue packages across various package management communities.

jtlapp commented 8 years ago

Fantastic! Thank you. I'll check kue out.

jtlapp commented 8 years ago

I'll close out all these issues and maybe later come back with something more intelligent, once I understand how things work. Thanks for your help!

timgit / pg-boss

Throttling the publisher is problematic #4