timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.15k stars 160 forks source link

Duplicate key value violates unique constraint: job_singleton_queue #522

Open tkambler opened 1 week ago

tkambler commented 1 week ago

Could you help me understand what could possibly be causing this error to be emitted by PGBoss?

{
  "message": "duplicate key value violates unique constraint \"job_singleton_queue\"",
  "name": "error",
  "stack": "error: duplicate key value violates unique constraint \"job_singleton_queue\"\n    at /app/node_modules/pg/node_modules/pg-pool/index.js:45:11\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at Db.executeSql (/app/node_modules/pg-boss/src/db.js:28:14)\n    at Boss.executeSql (/app/node_modules/pg-boss/src/boss.js:249:14)\n    at Boss.expire (/app/node_modules/pg-boss/src/boss.js:216:5)\n    at Boss.onMaintenance (/app/node_modules/pg-boss/src/boss.js:132:7)\n    at resolveWithinSeconds (/app/node_modules/pg-boss/src/manager.js:35:14)\n    at /app/node_modules/pg-boss/node_modules/p-map/index.js:57:22",
  "code": "23505"
}
tkambler commented 1 week ago

From what I gather, here is what's happening:

executeSql
    BEGIN;
    SET LOCAL statement_timeout = '30s';
    SELECT pg_advisory_xact_lock(
      ('x' || md5(current_database() || '.pgboss.pgboss'))::bit(64)::bigint
  );

    WITH results AS (
      UPDATE pgboss.job
      SET state = CASE
          WHEN retryCount < retryLimit THEN 'retry'::pgboss.job_state
          ELSE 'expired'::pgboss.job_state
          END,
        completedOn = CASE
          WHEN retryCount < retryLimit
          THEN NULL
          ELSE now()
          END,
        startAfter = CASE
          WHEN retryCount = retryLimit THEN startAfter
          WHEN NOT retryBackoff THEN now() + retryDelay * interval '1'
          ELSE now() +
            (
                retryDelay * 2 ^ LEAST(16, retryCount + 1) / 2
                +
                retryDelay * 2 ^ LEAST(16, retryCount + 1) / 2 * random()
            )
            * interval '1'
          END
      WHERE state = 'active'
        AND (startedOn + expireIn) < now()
      RETURNING *
    )
    INSERT INTO pgboss.job (name, data, keepUntil)
    SELECT
      '__state__completed__' || name,
      jsonb_build_object(
    'request', jsonb_build_object('id', id, 'name', name, 'data', data),
    'response', null,
    'state', state,
    'retryCount', retryCount,
    'createdOn', createdOn,
    'startedOn', startedOn,
    'completedOn', completedOn,
    'failed', CASE WHEN state = 'completed' THEN false ELSE true END
  ),
      keepUntil + (keepUntil - startAfter)
    FROM results
    WHERE state = 'expired'
      AND NOT name LIKE '__state__completed__%'
      AND on_complete
  ;
    COMMIT;

This results in the following SQL error:

{
  "message": "duplicate key value violates unique constraint \"job_singleton_queue\"",
  "name": "error",
  "stack": "error: duplicate key value violates unique constraint \"job_singleton_queue\"\n    at /app/node_modules/pg/node_modules/pg-pool/index.js:45:11\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at Db.executeSql (/app/node_modules/pg-boss/src/db.js:28:14)\n    at Boss.executeSql (/app/node_modules/pg-boss/src/boss.js:249:14)\n    at Boss.expire (/app/node_modules/pg-boss/src/boss.js:216:5)\n    at Boss.onMaintenance (/app/node_modules/pg-boss/src/boss.js:132:7)\n    at resolveWithinSeconds (/app/node_modules/pg-boss/src/manager.js:35:14)\n    at /app/node_modules/pg-boss/node_modules/p-map/index.js:57:22",
  "code": "23505"
}

The reason being that this violates the following constraint:

CREATE UNIQUE INDEX IF NOT EXISTS job_singleton_queue ON pgboss.job (name, singletonKey) WHERE state < 'active' AND singletonOn IS NULL AND singletonKey LIKE '\_\_pgboss\_\_singleton\_queue%';

In other words, it can't assign retry status to the job because more than two jobs in a singleton queue cannot exist that have state < active (i.e. created, retry).

Have we done something wrong to surface this problem, or is this a bug inherent in the implementation?