Closed bhsiaoNFLX closed 3 years ago
It seems like pg-boss isn't cluster based out of the box (from a DB perspective) so maybe the answer is as simple as just don't hammer the DB that much but I wanted to double check regardless just in case
There are a lot of factors to consider here, but here's some general guidelines I use for our largest instances.
batchSize
or teamSize
, which both result in fetching batches of jobs in the same fetch interval. This is very much a case by case basis, but can work very well. For example, I was able to increase job throughput 50x in one of my subscriptions by using batchSize and getting an array of jobs. In my case, each job was able to complete quickly with a shared resource instead of requiring a new resource to be acquired per job.fetch()
.onComplete
config property in the constructor and/or publish()
. This will make sure you have the least amount of jobs in the queue based on which ones you intend to subscribe to.In regards to Aurora, I don't think it's compatible since AWS built their own storage layer and I would guess SKIP LOCKED is not implemented. If I'm wrong about that, it's a great thing to be wrong about. :)
Thank you for the detailed response. My response to each of your points (if you don't have the time, jump to 6.)
teamSize: 20,
teamConcurrency: 20,
newJobCheckInterval
which we're setting to 100
. Let me know if you'd know this seems too aggressive.onComplete: false
that we pass in which will prevent creation of the internal job rows (__state__completed__xx
), ergo cutting table size by half, right?teamConcurrency
increases polling back pressure while scaling out PgBoss workers also empties the queues faster. In other words it sounds like they achieve the same effect, or am I mistaken here? Great if you had an answer for this because it's been a confusing me for a while.In regards to Aurora, I don't think it's compatible since AWS built their own storage layer and I would guess SKIP LOCKED is not implemented. If I'm wrong about that, it's a great thing to be wrong about. :)
Have you actually tried Aurora and verified pg-boss doesn't work on it? Or just speculating? Regarding SKIP LOCKED
I imagine it'd work because Postgresql is Postgresql regardless of RDS v. Aurora. As long as I stick with Postgres 11.4 or above shouldn't I be just fine? SKIP LOCKED
is mentioned in the Postgres 11 documentation.
When you say "fetch intervals" I'm assuming you're referring to newJobCheckInterval which we're setting to 100. Let me know if you'd know this seems too aggressive.
Setting the fetch interval to 100ms results in over-querying for new work for most use cases. It's there to be used in special cases, as in an overridden default for specific low latency subscriptions. Regardless, I can't complain too loudly here with allowed config setting, since pg-boss does allow this value. Just something to be thinking about in regards to putting read pressure on your database.
Glad to know. We're already on 5.0.8 so once 5.2.0 is out migrating to it should be simple. So I understand it'll be a onComplete: false that we pass in which will prevent creation of the internal job rows (statecompleted__xx), ergo cutting table size by half, right?
Yes. Released 5.2.0 earlier today btw.
Explain something to me: increasing teamConcurrency increases polling back pressure while scaling out PgBoss workers also empties the queues faster. In other words it sounds like they achieve the same effect, or am I mistaken here?
pg-boss uses p-map for teamConcurrency
, so you could exhaust Node's event loop with really high settings here on a single instance. This is something you should experiment with to determine how your job handlers are behaving, since it's very much a case by case basis. For example, I've seen CPU util spike on a high concurrency settings in the past on some of my queues related to this.
There are actually 2 other options to consider for scaling out and concurrency.
Use batchSize
instead of teamSize
. For example, if your use case in the handler can actually convert an array of jobs to a batch operation via Array.map(), this can drastically increase throughput and reduce downstream resource utilization.
Create multiple subscriptions on the same queue. Essentially this is as easy as calling subscribe()
twice for the same queue. What this does behind the scenes is set up 2 independent workers, each with their own backpressure loops.
I actually use both of these, along with independent Node proceses via docker containers, to give you more "real world" examples of combinations of hosting and techniques.
Can you provide more concrete pointers on how to set this up? Is Range Partitioning the right place to start? Any sample code and wisdom would be appreciated.
This code would create 2 independent pg-boss instances completely independent of each other in regards to storage. options
in this example is whatever you're currently passing it for config
const boss1 = new PgBoss({ ...options, schema: 'pgboss1' })
const boss2 = new PgBoss({ ...options, schema: 'pgboss2' })
I imagine it'd work because Postgresql is Postgresql regardless of RDS v. Aurora.
I haven't tried it, but my understanding is that Aurora is an implementation of the postgres query layer (wire protocol), but with a completely different (and proprietary) storage engine. This is similar to other products such as cockroachdb or yugabytedb if you want to think of it like that. SKIP LOCKED
was added to postgres to resolve issues with queue-like record locking problems inherent in MVCC, which these other databases don't use.
Setting the fetch interval to 100ms results in over-querying for new work for most use cases.
What fetch interval would you recommend?
Yes. Released 5.2.0 earlier today btw.
woo great timing! Just upgraded today.
Use batchSize instead of teamSize.
Got it. I will recommend it to my team.
I actually use both of these, along with independent Node proceses via docker containers, to give you more "real world" examples of combinations of hosting and techniques.
Nice, didn't know about technique 2. Though if I can provide some feedback, this seems like tribal knowledge and if it's a thing for performance it should be documented or built-in. Same holds true for teamSize
vs. batchSize
- if one is generally better than the other performance-wise please consider documenting it :)
const boss1 = new PgBoss({ ...options, schema: 'pgboss1' })
const boss2 = new PgBoss({ ...options, schema: 'pgboss2' })
Okay please explain how this actually improves performance in the scenario where there's one target database and CPU utilization hovers near 100%. If there were two target databases that'd make sense because the CPU util. is split in half for both, but you're saying separating the schema out would yield better performance and that's the part I'm not getting.
I haven't tried it, but my understanding is that Aurora is an implementation of the postgres query layer (wire protocol), but with a completely different (and proprietary) storage engine. This is similar to other products such as cockroachdb or yugabytedb if you want to think of it like that. SKIP LOCKED was added to postgres to resolve issues with queue-like record locking problems inherent in MVCC, which these other databases don't use.
I see. That seems like bad ergonomics for Postgres. I'll have to take your word for it and revisit at a later time.
What fetch interval would you recommend?
I can't easily answer that question for you. How often would like to check for jobs? What is a reasonable latency between creating and receiving a job? Answering these questions will help out deciding this. It can be customized per subscription or a global default.
this seems like tribal knowledge and if it's a thing for performance it should be documented or built-in
Guilty. The subscribe() docs don't mention this implementation detail. That is, "subscribe() adds a new polling worker to a collection of workers". I should add this to the docs.
Same holds true for teamSize vs. batchSize - if one is generally better than the other performance-wise please consider documenting it
teamSize is more of a convenience abstraction over batchSize, where you have to work with the job array directly, manually completing and/or failing jobs. You would need to try both to determine if the extra effort required to use batchSize is worth it.
please explain how this actually improves performance in the scenario where there's one target database and CPU utilization hovers near 100%
I doubt this would help if your db server is out of resources. The only point I was trying to make is that this introduces manual partitioning for all the reasons partitioning is good for a large table.
Perhaps a little bit late but good to know for others. To scale pg boss horizontally when your system is postgres performance bound is to create postgres read replicas and connect Pg boss to those and use the master slave only to publish jobs. See https://github.com/bitnami/bitnami-docker-postgresql#setting-up-a-streaming-replication for possible setup
wouldn't fetching a job with skip locked (+ the subsequent update) require read-locks and be master-bound?
We're running into a scaling concern with the
nextJob
internal query as we're reaching 1M+ jobs and 2M+ archived jobs.Any wisdom on how to horizontally scale pg-boss DB operations with a proven track record?
Additional context