Closed laurisvan closed 2 years ago
Do you have a large archive table?
I will look into adding more safeguards around transactional advisory locks
When checked, we had some 500k items in queue and 240k in archive. We've noticed before that such big queues do cause problem with pgboss (I suspect indexes aren't used properly) and it resolves to quite a lot of disk activity. Since we are running on AWS RDS, when we run out of burst quotas, the disk access will be extremely slow (some 2,4megabits per second, e.g. USB1 speed).
When checking this out further, this advisory lock is just the burst quota problem escalating - every pgboss access to disk was super slow, as the buffer cache nor indexes in memory were not used properly. This was just the most visible symptom to it.
If we somehow can make pgboss more index friendly, I suspect it will solve the problem.
The archive query does appear to have the correct index on the table. I think the best approach will be to add a limit of how many jobs we can archive at a time, combined with a transaction scoped query timeout for deadlock prevention.
Our database yesterday got totally stuck due to several connections waiting for shared advisory lock to archive. Here's the statement that caused the lock:
It appears as if the archive cleanup got executed several times, and due to some deadlock scenario, it could never been completed. Unfortunately I cannot repeat the scenario, so I have no clear steps to repeat. Nevertheless, it would be useful to understand how pgboss can drift into such a situation.