Extremely Slow `check_available` in Basic Engine (also used for pro smart engine)

maennchen commented 3 months ago

Environment

Oban Version: 2.18.0
PostgreSQL Version: 15.1-1.pgdg20.04+1
Elixir & Erlang/OTP Versions (elixir --version): 1.17.2 on OTP 27.0.1

Current Behavior

After creating a very large number of jobs (10M+), I noticed a very noticeable slow down of the system. I found the following query in the slow query logs:

SELECT DISTINCT o0."queue" FROM "public"."oban_jobs" AS o0 WHERE (o0."state" = $1) AND (NOT (o0."queue" IS NULL))

Expected Behavior

The query should not block for a long time when there's lots of jobs.

Exploration

This query is executed by Oban.Engines.Basic.check_available/1.

The query has multiple issues:

queue can never be nil since the column is required
DISTINCT queries can't be indexed properly in postgres
- resulting in a bitmap heap scan (going through almost all rows)
- depending of how long ago the last vacuum happened, it's sometimes also a sequential scan of the whole table

Query Explain: https://explain.dalibo.com/plan/5ea6gaed026672ed

By checking the existence of the queues present in the oban_producers table or just specifying the queues to check if they are available, reduces this down to < 100ms even with millions of jobs.

CREATE INDEX CONCURRENTLY oban_jobs_active_queue_index
    ON oban_jobs (queue) WHERE state = 'available';
CREATE INDEX CONCURRENTLY  oban_producers_queue_index
    ON oban_producers(queue);
SELECT
    oban_producers.queue
    FROM oban_producers
    WHERE EXISTS(
        SELECT
            1
            FROM oban_jobs
            WHERE
                oban_jobs.state = 'available' AND
                oban_jobs.queue = oban_producers.queue
    );

https://explain.dalibo.com/plan/da8e8a68g3gad869

The query is still far from optimal, the index scan on the jobs table is massively over estimating the amount of rows and could still be improved considerable, but is 2 orders of magnitude quicker.

sorentwo commented 3 months ago

We're ahead of you on this one 🙂

Pro v1.5 adds a custom check_available/1 implementation that is vastly faster (< 8ms for 9.6 million jobs). There's a description of the change and a performance comparison table in the CHANGELOG: https://getoban.pro/docs/pro/1.5.0-rc.0/changelog.html#enhancements

queue can never be nil since the column is required

The use of is not null was to force index usage in older versions of Oban where the compound index had queue before state. It's not necessary now, but there are Oban installations out there that still use the old index because it was an optional migration.

maennchen commented 3 months ago

@sorentwo Amazing, will update to the RC immediately then :)

maennchen commented 3 months ago

@sorentwo Works perfectly.

Before fix applied: Screenshot from 2024-08-15 15-33-06

After fix applied: Screenshot from 2024-08-15 17-16-31

oban-bg / oban