timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.09k stars 158 forks source link

Deadlock when archiving jobs #282

Closed jorgemarsal closed 2 years ago

jorgemarsal commented 2 years ago

I'm seeing these deadlock errors on the Postgres logs (version 10.17, AWS RDS):

2021-10-16 18:09:32 UTC:172.31.9.74(60264):postgres@ocr:[12645]:ERROR:  deadlock detected
2021-10-16 18:09:32 UTC:172.31.9.74(60264):postgres@ocr:[12645]:DETAIL:  Process 12645 waits for ShareLock on transaction 17119874; blocked by process 10732.
    Process 10732 waits for ShareLock on transaction 17119873; blocked by process 12645.
    Process 12645: 
        WITH archived_rows AS (
          DELETE FROM pgboss.job
          WHERE
            completedOn < (now() - CAST($1 as interval))
            OR (
              state = 'created'
              AND name LIKE '__state__completed__%'
              AND createdOn < (now() - CAST($1 as interval))
            )
          RETURNING *
        )
        INSERT INTO pgboss.archive (
          id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn
        )
        SELECT 
          id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn
        FROM archived_rows

    Process 10732: 
        WITH archived_rows AS (
          DELETE FROM pgboss.job
          WHERE
            completedOn < (now() - CAST($1 as interval))
            OR (
              state = 'created'
              AND name LIKE '__state__completed__%'
              AND createdOn < (now() - CAST($1 as interval))
            )
          RETURNING *
        )
        INSERT INTO pgboss.archive (
          id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn
        )
        SELECT 
          id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn
        FROM archived_rows

2021-10-16 18:09:32 UTC:172.31.9.74(60264):postgres@ocr:[12645]:HINT:  See server log for query details.
2021-10-16 18:09:32 UTC:172.31.9.74(60264):postgres@ocr:[12645]:CONTEXT:  while deleting tuple (55,17) in relation "job"
2021-10-16 18:09:32 UTC:172.31.9.74(60264):postgres@ocr:[12645]:STATEMENT:  
        WITH archived_rows AS (
          DELETE FROM pgboss.job
          WHERE
            completedOn < (now() - CAST($1 as interval))
            OR (
              state = 'created'
              AND name LIKE '__state__completed__%'
              AND createdOn < (now() - CAST($1 as interval))
            )
          RETURNING *
        )
        INSERT INTO pgboss.archive (
          id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn
        )
        SELECT 
          id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn
        FROM archived_rows
timgit commented 2 years ago

I've seen deadlocks from maintenance operations before, but only when the job table is very large or the maintenance intervals are set to run too often. Can you include your configuration properties and an approx record count in pgboss.job?

jorgemarsal commented 2 years ago

Thank you for the quick response @timgit !

Counts:

select count(*) from pgboss.job j ;
-- 15119
select count(*) from pgboss.archive a ;
-- 60478

Config:

expireIn: 5 minutes
team size: 1
team concurrency: 1
retryLimit: 1
newJobCheckIntervalSeconds: 1
expireCheckIntervalSeconds: 1

Let me know if you need any other information.

timgit commented 2 years ago

What version of pg-boss are you running? There is no expireCheckIntervalSeconds config property in the latest version

jorgemarsal commented 2 years ago

I was running an old version. I upgraded to the latest 6.2.2. I'll keep an eye to see if the errors are gone and report back. Thanks @timgit !