isamplesorg / isamples_inabox

Provides functionality intermediate to a collection and central
0 stars 1 forks source link

PostGres Performance Problems with prometheus exporter query #281

Closed dannymandel closed 1 year ago

dannymandel commented 1 year ago

We've seen postgresql have a number of problems. It seems to boil down to a summary query running incredibly slowly:

isb_1=# explain analyze SELECT thing.authority_id AS authority, count(thing.authority_id) AS count FROM thing GROUP BY thing.authority_id;
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1669943.81..1669944.82 rows=4 width=15) (actual time=68650.372..68657.527 rows=5 loops=1)
   Group Key: authority_id
   ->  Gather Merge  (cost=1669943.81..1669944.74 rows=8 width=15) (actual time=68650.353..68657.496 rows=13 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=1668943.78..1668943.79 rows=4 width=15) (actual time=68620.539..68620.550 rows=4 loops=3)
               Sort Key: authority_id
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 25kB
               Worker 1:  Sort Method: quicksort  Memory: 25kB
               ->  Partial HashAggregate  (cost=1668943.70..1668943.74 rows=4 width=15) (actual time=68620.482..68620.492 rows=4 loops=3)
                     Group Key: authority_id
                     Batches: 1  Memory Usage: 24kB
                     Worker 0:  Batches: 1  Memory Usage: 24kB
                     Worker 1:  Batches: 1  Memory Usage: 24kB
                     ->  Parallel Seq Scan on thing  (cost=0.00..1656472.47 rows=2494247 width=7) (actual time=3.910..66525.704 rows=2003018 loops=3)
 Planning Time: 194.661 ms
 JIT:
   Functions: 24
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 3.429 ms, Inlining 209.983 ms, Optimization 246.804 ms, Emission 159.546 ms, Total 619.763 ms
 Execution Time: 68698.541 ms
(22 rows)

That's ~69 sec for that query. The inner part of the loop is doing a sequential scan:

                     ->  Parallel Seq Scan on thing  (cost=0.00..1656472.47 rows=2494247 width=7) (actual time=3.910..66525.704 rows=2003018 loops=3)

If we convert it to doing it based on each authority, it uses the index, e.g.

isb_1=# explain analyze select count(*) from thing where authority_id='SESAR';
                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1615751.78..1615751.79 rows=1 width=8) (actual time=2463.386..2470.974 rows=1 loops=1)
   ->  Gather  (cost=1615751.56..1615751.77 rows=2 width=8) (actual time=2463.232..2470.958 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=1614751.56..1614751.57 rows=1 width=8) (actual time=2434.421..2434.425 rows=1 loops=3)
               ->  Parallel Index Only Scan using resolved_status_authority_id_idx on thing  (cost=0.43..1609921.66 rows=1931960 width=0) (actual time=8.436..1272.184 rows=1562795 loops=3)
                     Index Cond: (authority_id = 'SESAR'::text)
                     Heap Fetches: 82983
 Planning Time: 0.193 ms
 JIT:
   Functions: 11
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.610 ms, Inlining 142.466 ms, Optimization 41.098 ms, Emission 27.292 ms, Total 212.466 ms
 Execution Time: 2471.750 ms
(14 rows)

We shoud probably break it up by authority to avoid overtaxing the database. A short-term fix is to run the query once an hour instead of once a minute.

dannymandel commented 1 year ago

Addressed in https://github.com/isamplesorg/isamples_inabox/issues/283