timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
1.95k stars 153 forks source link

PgBoss stuck with pg_advisory_xact_lock when cleaning up archive #323

Closed laurisvan closed 2 years ago

laurisvan commented 2 years ago

Our database yesterday got totally stuck due to several connections waiting for shared advisory lock to archive. Here's the statement that caused the lock:

BEGIN; SELECT pg_advisory_xact_lock( ('x' \|\| md5(current_database() \|\| '.pgboss.pgboss'))::bit(64)::bigint ); DELETE FROM pgboss.archive WHERE archivedOn < (now() - interval '7 days') ; COMMIT;

It appears as if the archive cleanup got executed several times, and due to some deadlock scenario, it could never been completed. Unfortunately I cannot repeat the scenario, so I have no clear steps to repeat. Nevertheless, it would be useful to understand how pgboss can drift into such a situation.

timgit commented 2 years ago

Do you have a large archive table?

I will look into adding more safeguards around transactional advisory locks

laurisvan commented 2 years ago

When checked, we had some 500k items in queue and 240k in archive. We've noticed before that such big queues do cause problem with pgboss (I suspect indexes aren't used properly) and it resolves to quite a lot of disk activity. Since we are running on AWS RDS, when we run out of burst quotas, the disk access will be extremely slow (some 2,4megabits per second, e.g. USB1 speed).

When checking this out further, this advisory lock is just the burst quota problem escalating - every pgboss access to disk was super slow, as the buffer cache nor indexes in memory were not used properly. This was just the most visible symptom to it.

If we somehow can make pgboss more index friendly, I suspect it will solve the problem.

timgit commented 2 years ago

The archive query does appear to have the correct index on the table. I think the best approach will be to add a limit of how many jobs we can archive at a time, combined with a transaction scoped query timeout for deadlock prevention.