Closed ericholiveira closed 1 year ago
Thanks so much for the feedback! I've also seen performance of our instances start to decrease once millions of records exist in the job table. One of the challenges of changing the id data type is native postgres support, since currently the pgcrypto module is generating uuids for direct insertion use cases.
I'm assuming by your question that these are not deferred jobs, since you didn't mention timestamps in your suggestion. This would imply all jobs are immediately available to be worked as soon as they are created. It would then seem like the goal is to prevent the table from growing this large instead of engineering a solution to handle this load. Ideally, we should be able to keep queue sizes low at all times or backlogs are created and processing latency increases.
Please answer the questions below to add more context.
When you say you have a queue, are all of the 20M jobs all in the same queue? If you had multiple queues, you could manually partition them into different schemas, with the caveat of you needing to manage multiple pg-boss instances in your code base. Partitioning is the easiest way to resolve performance in large tables in postgres.
If partitioning is not an option, is the 20M the queue backlog (created state) or are most of these completed? If you have a performance issue when most jobs are in the completed state, you can configure the archive to more aggressively move completed jobs out of the job table to reduce it's size and improve index performance.
If archive configuration doesn't help, how many instances are you running and what is your maximum concurrency? If you created 20M jobs right now, how long would it take your workers to drain the queue, ignoring the performance issue mentioned here? Do you have rate limits preventing you from increasing your worker concurrency? Are you currently using batchSize
or fetch()
to get more than 1 job per query?
Do you have a way of monitoring the queue size and dynamically scaling out/in compute capacity? This is a pattern we use in AWS with auto scale groups, when we can't predict when a large spike in jobs will occur.
Thanks for the quick response... actually they are deferred jobs, but as the query already filter by "startAfter" i just wanted to simplify the sort query to gain a little bit of performance as the other columns are not indexed....
Just to clarify the queue even is this size is working, as i dont need to keep fetching data every 2 seconds, all i had to do was to increase the fetch interval and everything is working atm... Just wanted to suggest an improvement
Answering your questions: 1 - They are all in the same queue, my initial plan was to logically shard queue (queue1, queue2...) to reduce the row scan for each one... i can also partition it in different schemas if needed... 2 - Most of them are in created state 3 - Im using teamSize and Concurrency... from a worker perspective i can handle all the jobs without any effort, my main problem was fetching this data from the db 4 - I sure do monitor queue size on datadog, but as i said from a consumer perspective theres no backpressure, the consumers could handle way more than i have
Thanks for the responses. If I understand your use case, it's basically, "there will always be millions of deferred jobs, so therefore the table size will never be reduced". In this case, your overall performance is always going to be sub-optimal, affecting even the internal maintenance jobs pg-boss uses behind the scenes. Manual partitioning is going to give you the best return on investment in my opinion. I would prefer to keep the large deferred jobs out of the primary job table until they're ready to be worked, probably using an intermediate table with a scheduled job that can monitor it and decide when to place them into the queue. There are lots of different ways to address this, but I think the focus should be on keeping the record counts as low as possible before trying something more advanced.
First of all, congratz for the amazing project... Currently we have been using it in production and in our use case we have a queue with 10-20M itens, and the fetch query sometimes take a lot of time to run... We dont need priority in our queue and we are just interested in the insertion order, so it would be amazing to support ulid as the job identifier because it is Lexicographically sortable, and also to be able to turn off the "priority" feature for some queues this way we can change the fetch query from
ORDER BY priority desc, createdOn, id
toORDER BY id
might seem as a minor change, but has better performance for big queuesAnother option would have the option to pass an id generator (so the developer can choose the best implementation for their use case) and also let turn off the fields from sort