timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.04k stars 157 forks source link

Slow queries from pg-boss #126

Closed mshwery closed 4 years ago

mshwery commented 5 years ago

We have an archive retention policy of 30 days before we evict from the archive table, but it seems that this produces some very poor query times as the number of archived jobs increases...

Should pgboss create an index on archivedOn since it heavily depends on that for it's eviction query?

timgit commented 5 years ago

Would you share the number of records in archive and how long it’s taking? I think it’s fine to add an index, and it would be great if you could add it and show the diff in perf

mshwery commented 5 years ago

Here are a few examples of slow query logs we are seeing. All are related to the archiving or eviction queries:

27 seconds

{
  "data": {
    "parameters": [
      "1 hour"
    ],
    "query": "\n    WITH archived_rows AS (\n      DELETE FROM pgboss.job\n      WHERE\n        (completedOn + CAST($1 as interval) < now())\n        OR (\n          state = 'created'\n          AND name LIKE '__state__completed__%'\n          AND createdOn + CAST($1 as interval) < now()\n        )\n      RETURNING *\n    )\n    INSERT INTO pgboss.archive (\n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    )\n    SELECT \n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    FROM archived_rows\n  ",
    "time": 27139
  },
  "message": "Slow query execution time: 27139"
}

219 seconds

{
  "data": {
    "parameters": [
      "30 days"
    ],
    "query": "\n    DELETE FROM pgboss.archive\n    WHERE (archivedOn + CAST($1 as interval) < now())\n  ",
    "time": 219570
  },
  "message": "Slow query execution time: 219570"
}

145 seconds

{
  "data": {
    "parameters": [
      "30 days"
    ],
    "query": "\n    DELETE FROM pgboss.archive\n    WHERE (archivedOn + CAST($1 as interval) < now())\n  ",
    "time": 145607
  },
  "message": "Slow query execution time: 145607"
}

50 seconds

{
  "data": {
    "parameters": [
      "1 hour"
    ],
    "query": "\n    WITH archived_rows AS (\n      DELETE FROM pgboss.job\n      WHERE\n        (completedOn + CAST($1 as interval) < now())\n        OR (\n          state = 'created'\n          AND name LIKE '__state__completed__%'\n          AND createdOn + CAST($1 as interval) < now()\n        )\n      RETURNING *\n    )\n    INSERT INTO pgboss.archive (\n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    )\n    SELECT \n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    FROM archived_rows\n  ",
    "time": 50476
  },
  "message": "Slow query execution time: 50476"
}

There are many of these in the logs each day. I can get you more details later on how many records, etc., and the result of adding an index on the archive table.

mshwery commented 5 years ago

~24M rows at the moment (which is about 10x more than expected), though I see this is in part because pgboss also keeps a row representing the job state transition like __state__completed__FOO_JOB.

mshwery commented 4 years ago

@timgit another thought – would it be a reasonable idea to allow individual queues to define their own archive retention? I realize that might be a big shift in the inner workings of the lib, but in our case we have some jobs that we want to keep around for awhile, and others that can be evicted within a day.

mshwery commented 4 years ago

Looks like the index can't help because of how the comparison is written. We need to isolate the index column on the left hand side of the comparison:

Before:

WHERE (archivedOn + CAST($1 as interval) < now())

After:

WHERE (archivedOn < now() - CAST($1 as interval))
mshwery commented 4 years ago

With a small enough table this doesn't seem to help either – it needs to do a Seq Scan because we are basically comparing between (now - interval) and infinity.

I was able to get the index used by using BETWEEN to create a range instead, but it's not foolproof because we don't know what the oldest time is in the table. And if the range is sufficiently large compared to the size of the table it may still not use the index

BEGIN;
EXPLAIN ANALYZE DELETE FROM pgboss.archive WHERE archivedOn BETWEEN (now() - CAST('2 days' as interval)) AND (now() - CAST('1 day' as interval));
ROLLBACK;
Delete on archive  (cost=14.17..1346.46 rows=573 width=6) (actual time=5.882..5.882 rows=0 loops=1)
  ->  Bitmap Heap Scan on archive  (cost=14.17..1346.46 rows=573 width=6) (actual time=0.123..0.530 rows=570 loops=1)
"        Recheck Cond: ((archivedon >= (now() - '2 days'::interval)) AND (archivedon <= (now() - '1 day'::interval)))"
        Heap Blocks: exact=136
"        ->  Bitmap Index Scan on ""IDX_archivedon""  (cost=0.00..14.03 rows=573 width=0) (actual time=0.098..0.098 rows=570 loops=1)"
"              Index Cond: ((archivedon >= (now() - '2 days'::interval)) AND (archivedon <= (now() - '1 day'::interval)))"
Planning time: 0.088 ms
Execution time: 5.917 ms
timgit commented 4 years ago

Thanks for looking into this and playing around the options. There are a lot of options I can use to address this, some even advanced such as partitioning the archive by date so archiving becomes a partition deletion instead of costly query.

Until that happens, I'm thinking the best course of action here would be to create your own retention system to pull the jobs you need out of the archive so you won't need to rely on it for anything of importance. If you were to have your own process copying the meaningful jobs out of the archive on an interval lower than the archive interval, then you could reduce the archive retention period low enough that the execution time is resolved.

timgit commented 4 years ago

i added an index to archivedOn in 3.2.1, which should help out in most cases, although I agree it's not a perfect solution for very large volumes.