timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.15k stars 160 forks source link

v10 #425

Closed timgit closed 3 months ago

StarpTech commented 1 year ago

@timgit are you open to feedback? You resolved all my comments without an answer.

KristjanTammekivi commented 8 months ago

Hi, is there any progress here? I would like to have the graceful shutdown to be implemented (#421)

timgit commented 8 months ago

Apologies for the long turn around time on this release. There's a few problems I'm still looking at.

xgenvn commented 7 months ago

Really looking forward to the new version, is there any support to make this happen? Maybe test and documentation?

sneljo1 commented 6 months ago

Which items are still open for this? Anything major?

coveralls commented 4 months ago

Coverage Status

coverage: 98.113% (-1.9%) from 100.0% when pulling 08c8d4b84a6adf7888b77adc2935eb6839a937b7 on v10 into f1c1636caf9518ec9cd3fe0d0e2844ee98179e68 on master.

janmeier commented 2 months ago

@timgit Can you put a couple of words on the motivation behind this major version? Which issues does this solve?

I'm saying this with the utmost understanding of what it means to release something as open source - This is your project, and I don't expect you to justify every decision. Still, I'm just curious about the reasoning, and I want to hear about any improvements that should cause us to consider upgrading :)

timgit commented 2 months ago

The primary motivation is stability and scalability. In my and others' experience, once a single queue exceeded 1-2mil created jobs in a backlog state, postgres was unable to quickly locate the next record for locking via SKIP LOCKED. This is mentioned as a limitation in the linked article in the readme. This produced high memory and CPU utilization issues on the server, which would then impact all other databases. The problem with pg-boss v9's design is that once any queue reaches this failure state, all queues, including internal maintenance queues, were affected since all jobs shared a single table. The design goal was to mitigate a problematic queue by isolating its storage into a dedicated table via partitioning. This doesn't necessarily remove this limitation of SKIP LOCKED, but it should in theory be less of a catastrophic failure if and when it occurs in the future.

Over the years, there were several issues opened around queue concurrency, which added several functions and configuration which made the API more complicated and difficult to understand (sendSingleton() and others). There were several ideas that I wanted to prototype, which resulted in queue policies, the biggest feature enhancement worth upgrading for.

After a couple of failed attempts at building a reasonable migration plan into the new table structure, I decided it was best to release v10 without a schema migration. This wasn't my first choice, but I think it still was the cleanest way to make it predictable what to expect between these versions.