timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.13k stars 158 forks source link

Update Documentation to clearly state it is not using postgres' pub/sub mechanisms. #93

Closed P-Seebauer closed 5 years ago

P-Seebauer commented 6 years ago

This module is very nicely written and has a nice API to it, too.

But when reading it, someone could have the impression this queue's implementation is using the publish/subscribe mechanisms that are built into postgres.

As this is just a polling worker distribution, imo it should be stated more clearly somehow.

timgit commented 6 years ago

Thanks! The readme does feature SKIP LOCKED prominently. I think it's clear what tech it's based on.

P-Seebauer commented 6 years ago

Do you think it would be feasible/of your interest to get it working as a non-polling scheduler? This module really looks interesting and I'd be willing to spend some time there, too.

timgit commented 6 years ago

Yes, I'm interested in how you envision implementing this. Currently, pg-boss relies on a pull architecture to handle worker communication failure and also what some would describe as an IOT pattern, where there are thousands of workers on dedicated queues. My current understanding of LISTEN/NOTIFY (L-N) is that it registers on a dedicated connection identified by a session pid.

I haven't built anything on top of this feature yet, so I'm mostly ignorant of the tradeoffs. I've searched a bit, and I've found some information, but nothing in specific detail. For example, this issue on the Queue-classic package.

Ultimately, my biggest concern with this initiative would be something along the lines of the law of diminishing returns. If we're able to successfully pull this off, the net benefit architecturally is the reduction of fetching queries and subsequently connections to the database server. In busy use cases, where a system produces a steady stream of jobs all day, every day, there would no benefit at all, as an L-N implementation would likely not remove the requirement of fetching a job to inspect its JSON payload.

I'm interested in your thoughts, as well as the following list below. What has been your experience with L-N, and how misinformed am I? :grin: Also, feel free to add to the list, or clarify what your MVP would be for a prototype. Thanks again!

Ideas to ponder

  1. How will it scale out as the connection count increases, considering use cases where connection poolers such as pgBouncer are in use.
  2. Would NOTIFY be issued as an insert trigger for new jobs? If so, how should we reduce the size of the internal notification queue during batch job insertion or COPY bulk loads? It must be able to handle millions of jobs being created in a very short time interval.
  3. How will an instance be able to check for work if a NOTIFY was missed because of temporary communication failure?
  4. If you have 5 instances sharing the work, how would we distribute the load across them? For example, if we were to iterate over a set of listeners and NOTIFY all of them one at a time by session pid, would this cause the first connection in line to receive more work than the second instance? If that's the case, we'd need to perhaps build our own load balancing abstraction of the listeners.
  5. How could we implement backpressure for an overloaded instance? It may require an instance detecting how busy it is, then unregistering it's LISTEN, perhaps?
  6. What's the potential of a conflict if pg-boss is used in the same database as another L-N system?
P-Seebauer commented 6 years ago

Sorry, a project hit harder than I expected. I'll have a look. After thinking a bit, it's probably not possible to work completely without polling (at least from first sight), but It could reduce the polling greatly. On Ideas to ponder:

  1. Whenever I used LISTEN I instatiated my own pg.Client that lived outside of the pool (to have a connection that was meant for listening only). I did that just because of habit (from redis where this is obligatory, found nothing like this in the pg docs). So your connections would be one more. It should not be that big of a deal.
  2. Yes, I thought about an insert/update trigger (possibly debounced), that did basically the same query as the one happening in the poll. I'm not sure about that big loads, to be honest, I've just worked with inserts in the thousands and NOTIFY yet, might be a deal breaker.
  3. That's the point that's probably hard to get. There would have to be a “sanity check”, but maybe not every second.
  4. See 2, They all would do the same thing that's now happening in the interval, but only if there's actually data there (so load balancing would be the same as it is now).
  5. Sounds fair to me, also possible to do without unregistering: as long as an instance is just notified, It could just not answer to the notification if it “thinks” it could not handle one more job.
  6. Have not found much documentation about that. My guess would be that channels are database-scoped, so maybe not call the channel “notifications” and make it an configurable option.
timgit commented 6 years ago

Since I think your concern is primarily in regards to overloading postgres with polling, can you share what load you're expecting to have?

Before you bite off the effort this architecture would require, I'd recommend spending some time using the configuration options provided around polling intervals. If this doesn't provide you enough options, you can always completely customize polling by skipping the subscribe() apis entirely and roll your own monitoring "sanity check" you mentioned using fetch() and complete().

I keep enhancing subscribe() to make it more friendly to database polling traffic, so there's probably more room there for advanced use cases like auto-scaling. For example, using the monitor-states event, you monitor the queue sizes, then dynamically spin up subscriptions. Once the volume decreases again, you could unsubscribe and reclaim the polling traffic back.

P-Seebauer commented 6 years ago

My Issue was basically twofold: I have some scheduled jobs that run very rarely (like „import all the data“) which trigger several jobs („import dataset a“ )on their own that should run in direct response to that and may or may not trigger jobs. My concern is that when there is a poll/waiting time for the subsequent jobs, I'll run into some overlaps just because of the waiting time (aka If the job pushed into the queue I want the system be informed via a node-timeout, but a system one). It's not really a architectural requirement right now, because my dataset is actually small.

timgit commented 6 years ago

What do you mean by overlap?

P-Seebauer commented 6 years ago

I mean that when you start one job and its subjobs that the subjobs are not finished when you start the next job (which could lead to clogging).

timgit commented 6 years ago

Thanks for the clarification. pg-boss doesn’t have this concept of sub jobs that you’re describing. It sounds like something along the lines of a saga. These are solvable problems, but I’m not sure the solution is something that pg-boss should be responsible for. I have several sagas in my app where I monitor long-running processes and also use cases where I have a pipeline of jobs that work together to finish a task.

Let me know if I’m misunderstanding you.

yammine commented 5 years ago

So a neat way I've seen Postgres LISTEN/NOTIFY used in another very similar job process library is just to supplement polling as a strategy for pulling jobs:

https://github.com/mbuhot/ecto_job/blob/master/lib/ecto_job/producer.ex#L156-L158

Essentially the subscriber polls on an interval, but also just immediately scans for jobs upon receipt of a notification of created record(s) on the jobs table.

I guess this would lead to a more consistent e2e processing latency when job creation cadence is all over the place