timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.13k stars 158 forks source link

Would like to see a big picture overview #3

Closed jtlapp closed 8 years ago

jtlapp commented 8 years ago

I've never used a transactional queue before and find myself having to infer the overall design from the API and configuration docs. I'm not sure I'm getting it right. It would be nice for the docs to include a big picture overview.

It seems that you implement a single queue of virtual queues, each virtual queue corresponding to a job name. Each job consumer subscribes to a virtual queue by job name.

When a job consumer receives a job, it processes the job and then calls a queue-provided callback to report job completion. The job may then be archived, meaning that it is removed from the virtual job queues and yet somehow also still available.

Questions:

(1) How does a job consumer report failure, such as to requeue the job for another attempt later? (2) Can failed jobs be requeued with a delay not indicated when the job was originally queued? (I'll be downloading web pages, may want to wait before retrying.) (3) Can I process jobs statefully? In my case, I need to queue a job to load an image from the web. Once successfully loaded, I then need to queue the job for conversion to a thumbnail. Can I just change the job name to put the job in a new queue? Or is this a dequeuing of the first job and a queuing of a second job in a new virtual queue? (4) I'm not seeing an option not to archive jobs. Is archival somehow necessary? My project doesn't seem to need archival, because each job results in a downstream database record. (5) How do I access archived jobs? How do I get rid of old ones? Would I need to code to your schema, or do you somehow provide access?

It would help to not only have answers to these particular questions, but also clarification of the big picture that such questions suggest I'm missing. Thank you! Looking forward to using your module...

timgit commented 8 years ago

Hi there! There's quite a bit here to respond to, so please excuse me ahead of time if I missed something in your questions or if I'm oversimplifying something.

First of all, in terms of an architectural overview, it's a task queue for processing long-running operations asynchronously. Your mention of "queue of queues" and "virtual queue" is focusing too much into the internals of how it works. Basically, you could do all you want without a task queue, but I'm assuming you are looking into a task queue because you want to process your workload asynchronously. I know very little of what you're trying to do, so I can't really tell you if a task queue is what you need.

I can give you an example of what I might use it for if it helps. I have a web API that allows users to request large zip files to be created. Since this will take a while, I don't want to block the request but instead acknowledge that the request was received and work on it later. The user can then check on the progress using other API endpoints that I would provide, for example. The benefit of persisting the request to a database would be if the server goes down or some other major failure of the application. Otherwise, you could hold onto the request in memory and not bother with a task queue.

In terms of failure, pg-boss emits an error event that you can listen to, but it's really for "something went really wrong with the job subsystem" type of errors (for example, a problem with your database). If you're referring to the code you add in your listener, you be as defensive as you'd like since it's just calling the function you passed.

For state, it's all up to you for orchestration. If a job needs to be started after another is finished, just start that job from the completion of the first one.

The task queue itself is sort of like a log of requests made to a system in the same way that web server may log requests made to an API. This can fill up your database after a while, so it's a good practice to get rid of them periodically. There's a configuration setting for how often you want this to run if you'd like to extend it.

Hope this helps!

jtlapp commented 8 years ago

Thanks for responding so quickly!

Okay, I guess I should first show that I need a transactional queue. I'll have users making requests of my server to index a web pages of their selection. I need to queue tasks to download these web pages, because I won't be doing it during user's HTTP request. Once a page downloads, I'll be examining it for certain image URLs. I then need to queue tasks requesting that these images be downloaded. Once the images are downloaded, I then need to queue tasks asking that these images be resized to thumbnails. All this happens in the background. I'll also offer ways to present progress back to the user. As I scale, I'll be dedicating servers to the job. It's in support of http://instarearth.com/approach.html

It sounds like you're saying that my "virtual queue" interpretation is misguided. How should I be interpreting job names? What else might they be for?

My question about failing a job regards this description of subscribe:

"handler will have 2 args, job and callback. The job object will have id, name and data properties, and the callback function should be used to mark the job as completed in the database. If you forget to use the callback to mark the job as completed, it will expire after the configured expiration period. The default expiration can be found in the configuration docs."

The description tells me the callback is for reporting successful completion. The only other option it mentions is failing to call the callback, in which case the job will eventually expire. How do I handle a situation in which the subscriber fails to do the job but want it later retried. Let's say the job is to download a web page and the page hangs. Does the failing subscriber first queue a new job duplicating the present job, and then call the callback to dequeue the old job? I wouldn't want to call your callback first, because I need to know that the job got requeued -- unless you have a way for me to call the callback saying job failed, please requeue.

The above text also suggests that once the subscriber is called, either the subscriber handles it or the job eventually expires. What if the process crashes after you call the subscriber and before the subscriber calls the callback? Seems like that's partly what a transactional queuing system is for.

I'm not sure I understood your state response. It sounds like you're saying that I cannot assign job state -- a job is either waiting to be done or it's done. If I want to move a job to a new state, I need to queue a new job in the new state and dequeue the old job.

Regarding the job archive, you said, "There's a configuration setting for how often you want this to run if you'd like to extend it." I'm not sure I see it. The expiration options seem to apply to jobs that don't complete. You have a bunch of job archive options. They all baffle me.

archiveCompletedJobsEvery: "When jobs become eligible for archive after completion." This option suggest that jobs have a limbo state between when they complete and when they get archived. I don't understand why you would wait to archive.

There are also archive check intervals indicasting how often jobs are archived. I assume that this kicks in after the archiveCompletedJobsEvery delay? This looks like another delay between job completion and archiving (job limbo period), but not a measure of when to clear out the archive.

Sorry, to mostly ask my questions again. Maybe we'll get there by refinement.

jtlapp commented 8 years ago

Update: Maybe I'm processing more of your words. It sounds like the archive might be a log file rather than a queryable store of jobs. In that case, I can externally manage the log. And in that case, it makes sense that you would only write to it in batch. If I'm on board, ignore my response about archives above.

(By way of explanation, I've been arguing with the owner of a module that does queuing and keeps all old jobs around. I've been trying to explain that I won't be keeping them around and that most transaction queuing systems wouldn't either. I guess I was confusing your archiving with his.)

jtlapp commented 8 years ago

I think the only events I'd want logged are unhandled errors. In my case, logging anything else seems like unnecessary use of resources. Successfully handled jobs will we be reflected elsewhere in the database. So I'd rather be able to turn off logging. or at least logging of everything but errors.

timgit commented 8 years ago

The jobs that get archived is just a table in a new schema that is created in your database called job. Waiting to archive a job is merely for support/review purposes. In my experience it's useful to see at least a day's worth of job history for reporting or monitoring purposes if needed. You can throttle down the interval if you want to just discard them. That's why I created it as a configuration setting. :) Some may want to keep job history around for a bit, but others may not. It's really a case by case determination.

In regards to your state question, I pass job data payloads between multiple jobs for orchestration purposes, It's hard to explain here in the verbosity of a typical conversation, but you can chain jobs together if needed. You can just boss.publish('someNewJob', payload) from a previous boss.subscribe handler. You would just call done() after you publish a new job.

If you're concerned about a timeout error, you would just write that logic into your handler and make sure it's within the expiration setting of the job.

jtlapp commented 8 years ago

Okay, thanks. Still missing one piece. What is the job name for? Presumably it's not a unique ID. Seems like it should represent a sub-queue or virtual queue of that name.

It sounds like archiveCompletedJobsEvery says when to transfer jobs to the job table, and the "check interval" is the duration for which they live in this table. After that, they are deleted. If this is how it works, the doc isn't clear about this.

I still would like clarification on my timeout issue. It sounds like that if a subscriber gets a timeout (within the job expiration period), and if the subscriber wants to postpone the retry, the subscriber first posts a new job for the next try and then calls your done callback to dequeue the prior job.

If I get this working, I suspect you'll find me offering clarifications to the docs!

jtlapp commented 8 years ago

Also, I need two processes on the same computer to access the same queue. I assume nothing prohibits me from loading pg-boss into each process for the same queue?

I also have to manage compound jobs. A job to load an HTML page expands into jobs to load images found on that page. If I want to track the progress of loading the page's images, it sounds like I'll need to create my own table of jobs keeping files/bytes loaded for the page. (Edit: Deleted business of queuing a completion job, because only one client could service the job.)

jtlapp commented 8 years ago

Does the subscribe method guarantee that only one subscriber will receive a job? Or do all subscribers receive all jobs? What if the subscribers are in different processes or on different computers (hitting the same database)?

In my modelling, I'm noticing that I have logically different queues that would all use a single pg-boss task table. This suggests to me that the "job name" parameter is the logical queue name.

jtlapp commented 8 years ago

Until I examined the code, I was a bit concerned to see the description of newJobCheckInterval and then how throttling applies to the publisher. It seems that I can also throttle by changing the newJobCheckInterval, because apparently only one job is delivered at this interval, no matter how many are pending. The code tells me this on line 19 of https://github.com/timgit/pg-boss/blob/master/src/worker.js.

So this thread is basically a long list of important information missing from the docs. In thanks for your effort developing the code, I'll see if I can fill in the missing pieces.

jtlapp commented 8 years ago

Oh no. I take that back. The interval is set and fixed at queue construction, not at publication. Throttling is only done at the publisher. I'm going to create another issue for this.

jtlapp commented 8 years ago

According to https://github.com/timgit/pg-boss/blob/master/test/retryTest.js, a retry occurs after each job expiration. Would be nice to add that to the docs too.

An "expiration" is usually terminal. Maybe this should be called a "service timeout", "retry timeout", or just a "timeout."

timgit commented 8 years ago

Joe, hi. :) Good job on reading the tests and code. The tests are always more reliable than the docs. Did that answer the bulk of your questions?

jtlapp commented 8 years ago

I'm adding to my docs to-do list: say something that prevents the confusion expressed in #4, #5, and #6.

jtlapp commented 8 years ago

I'm still planning to get to this and offer you doc clarifications. I've been programming since 1982, so if I'm needing more information, surely others are too. Just got sidetracked...