malthe / pq

A PostgreSQL job queueing system
376 stars 41 forks source link

Performance with a very large queue #60

Closed pramsey closed 4 years ago

pramsey commented 4 years ago

Expected Behavior

Expect ms-level times for retrieving the next entry off the queue.

Actual Behavior

Times of several full seconds as the queue size exceeds a couple million.

Steps to Reproduce the Problem

Only potential quirk is this is against PgSQL 9.6. Version issues?

  1. Fill the queue.
  2. Attempt to pull records off the queue
  3. Time them

The bottleneck appeared to be the sort stage, which was not index assisted in the plans. As the number of items on the queue grew the sort took up more and more time.

https://github.com/malthe/pq/blob/master/pq/__init__.py#L308

Specifications

malthe commented 4 years ago

Do you have an EXPLAIN ANALYZE that might help determine the needed index?

pramsey commented 4 years ago
thematrix=> explain analyze SELECT * FROM thematrix.queue WHERE q_name = 'matrix' AND dequeued_at IS NULL ORDER BY schedule_at nulls first, expected_at nulls last, id LIMIT 1;
                                                                         QUERY PLAN                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=221449.70..221449.70 rows=1 width=646) (actual time=560.917..560.918 rows=1 loops=1)
   ->  Sort  (cost=221449.70..222840.64 rows=556377 width=646) (actual time=560.917..560.917 rows=1 loops=1)
         Sort Key: schedule_at NULLS FIRST, expected_at, id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Index Scan using priority_idx_no_queue on queue  (cost=0.42..218667.81 rows=556377 width=646) (actual time=0.182..388.776 rows=595240 loops=1)
               Index Cond: (q_name = 'matrix'::text)
 Planning time: 0.179 ms
 Execution time: 560.955 ms
(8 rows)

I did built a partial index with 'matrix' as the driving name instead of 'queue' but it didn't change the performance at all, it was exactly the same even using the other index.

malthe commented 4 years ago

So it is supposed to use the index defined here:

create index if not exists priority_idx_%(name)s on %(name)s
      (schedule_at nulls first, expected_at nulls last, q_name)
    where dequeued_at is null
          and q_name = '%(name)s';

But perhaps you don't have that index defined.

@3manuek – I don't really understand why there couldn't just be a single index (or two indexes if you count the other index as well) rather than one per value for q_name, -re- https://github.com/malthe/pq/commit/bc835d7bb8b33d145c4d4018ec462ae778e9ce42.

Seems to me like the right index here would be:

create index priority_idx on %(name)s
    (q_name, schedule_at nulls first, expected_at nulls first)
    where dequeued_at is null;

What's the reason for the priority_idx_no_ index actually?

pramsey commented 4 years ago

I did ensure I have a specific index for my queue name (though you have one for generic 'queue' and one for "everything else" and that seems like it should work fine too) and got the plan above. New indexes, re-indexing, analyzing, vacuuming... nothing moved the needle.

And yeah, it seems clear that the index should come into play, particularly for a SORT ... LIMIT 1 scenario, but it refused to, no matter what knobs I twiddle. I'm not sure if I should blame 9.6 or RDS, but something is awry.

malthe commented 4 years ago

Can you paste in the table definition including the indexe(s) you have on there? Just so that there's nothing crazy hiding there.

pramsey commented 4 years ago
                                   Table "thematrix.queue"
   Column    |           Type           |                     Modifiers                      
-------------+--------------------------+----------------------------------------------------
 id          | bigint                   | not null default nextval('queue_id_seq'::regclass)
 enqueued_at | timestamp with time zone | not null default now()
 dequeued_at | timestamp with time zone | 
 expected_at | timestamp with time zone | 
 schedule_at | timestamp with time zone | 
 q_name      | text                     | not null
 data        | json                     | not null
Indexes:
    "queue_pkey" PRIMARY KEY, btree (id)
    "priority_idx_matrix" btree (schedule_at NULLS FIRST, expected_at NULLS FIRST, q_name) WHERE dequeued_at IS NULL AND q_name = 'matrix'::text
    "priority_idx_no_queue" btree (schedule_at NULLS FIRST, expected_at NULLS FIRST, q_name) WHERE dequeued_at IS NULL AND q_name <> 'queue'::text
    "priority_idx_queue" btree (schedule_at NULLS FIRST, expected_at NULLS FIRST, q_name) WHERE dequeued_at IS NULL AND q_name = 'queue'::text
Check constraints:
    "queue_q_name_check" CHECK (length(q_name) > 0)
Triggers:
    pq_insert AFTER INSERT ON queue FOR EACH ROW EXECUTE PROCEDURE pq_notify()
malthe commented 4 years ago

Looks like your indexes are wrong: you have: expected_at NULLS FIRST rather than LAST.

pramsey commented 4 years ago

Did the pq code change at some point? Because these indexes were created by pq originally. That was four years ago now, so maybe there was a change in the interim.

malthe commented 4 years ago

Yeah it changed here: d0602bd7e73021a4f722f9fc2007c6324df38fd5.