timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.15k stars 160 forks source link

High Database CPU when processing backlog #418

Closed fenos closed 3 months ago

fenos commented 1 year ago

Hello, firstly let me take a second to give my appreciation to pg-boss, great library!

We are noticing an increase in CPU consumption on the Database when a backlog of jobs is building up. The chart looks like this:

image

From the investigation, it seems that the workers pulling the jobs are putting too much strain on the Database CPU when it can't keep up.

You can see from the graph below that the CPU was quite stable, however during the raise a producer started queuing many more messages. The enqueue operation doesn't seem correlated to CPU usage. Subsequetially the Drop in CPU was caused by canceling all the jobs via a SQL query

image

The workers are set up as follows:

 new PgBoss({
      connectionString: url,
      max: 4,
      application_name: 'pgboss',
      deleteAfterDays: 7,
      archiveCompletedAfterSeconds: 14_400,
      retentionDays: 7,
      retryBackoff: true,
      retryLimit: 20,
      expireInHours: 48,
      monitorStateIntervalSeconds: 30,
     newJobCheckInterval: 700,
    })

on the job handler, we use async / await to build up back pressure. Just to clarify the handler doesn't issue any SQL statement to this database, so the load is purely from pg-boss

We are currently running 1 worker in 4 separate containers, so each container should pull 1 message at a time.

I'm unsure why pg-boss would incrementally use more Database CPU even if the concurrency is set to 1 and the next message shouldn't be fetched until the previous one completes.

Currently, we are ok with processing these messages slower than they get produced, however, we can't justify the CPU increase.

Do you know what might cause the CPU to spike up internally in pg-boss?

timgit commented 1 year ago

How many records were created in the job table when the util was high?

fenos commented 1 year ago

Hi, @timgit recently the new producer is constantly sending approximately 30k jobs per hour but the current workers can not keep up and only be able to process half of it per hour. Totaling at 80% CPU there were approximately 1M jobs ~500k completed ~500k created

If I scale the worker number to be able to consume more messages the RDS CPU is getting consumed much faster (which makes sense) but with this workload is getting depleted immediately if I double the containers from 4 to 8

Currently, we have a cron-job that cancels all the jobs when the CPU is getting too high and we will replay them at a later date.

Once the jobs get canceled the CPU drops significantly.

We are running on a db.tg4.small RDS instance and we plan to upgrade to a bigger instance to keep up with the job processing, but wondering if this issue might be coming back even with a bigger instance

fenos commented 1 year ago

@timgit After further investigation it seems that the issue is caused by the sorting on the nextJob https://github.com/timgit/pg-boss/blob/master/src/plans.js#L379

Since there is no compound index on these columns when the table gets big, Postgres would start doing a disk merge sorting, causing high CPU usage as the table gets bigger.

I'm wondering if adding an index on the sorted columns could help speeding up things and reduce the load on Postgres when sorting

fenos commented 1 year ago

@timgit we have upgraded the RDS instance to a db.t4g.xlarge and unfortunately, this issue is still happening, is there anything we can do maybe as mentioned above, adding an index to the sort columns to reduce the disk sorting operation?

timgit commented 1 year ago

I have tried several times to build the best index for job fetching, but apparently this work will be ongoing. 😄 If you're up for it, can you experiment with your job table to find out what index best helps your use case?

DonKoko commented 9 months ago

hey @fenos . Did you manage to find a solution for this? We are kinda stuck in this situation and our DB CPU user keeps on increasing.

fenos commented 9 months ago

Yes, the "solution" is to have the least active jobs as possible as well as setting:

sledorze commented 8 months ago

@fenos had you tried to add an index ? if so, what was it and the result ?

chrisbujok commented 6 months ago

I'm experiencing the same issue. CPU usage is periodically growing and growing. I have only one worker with no concurrency. Very low traffic (one job every N minutes). Job execution time is roughly a couple of seconds. What I noticed, however, is that after restarting the worker CPU usage dropped to almost 0%. It's still growing back to ~75%.

timgit commented 4 months ago

In v9, all jobs, including internal maintenance jobs, share the same storage. In our experience, once we have > 1mil jobs, the performance of SKIP LOCKED starts to suffer. v10 introduced job storage partitioning, which should isolate perf issues to only the queue in question. This strategy is still being evaluated and not yet in production, but you are welcome to try out a beta release if you would like an early preview.

timgit commented 3 months ago

Please re-open if this continues to happen in v10