rubysherpas / paranoia

acts_as_paranoid for Rails 5, 6 and 7
Other
2.88k stars 528 forks source link

query performance: need help optimizing #495

Open spiicychknsandy opened 4 years ago

spiicychknsandy commented 4 years ago

I know this is not an issue for this repo but I have looked in a lot of places and tried a lot of different things and nothing is working.

I need to bulk update deleted_at for records for a table with 3M rows and I need to look at 2 other tables that both have 5-10M records.

My query is

UPDATE public.study_data_events sde
SET deleted_at = current_timestamp
WHERE sde.id
IN (
  SELECT s.id
  FROM public.study_data_events s
  LEFT JOIN public.study_data_values sdv on sdv.study_data_event_id = s.id and sdv.deleted_at is NULL
  LEFT JOIN public.intervention_phase_events ipe on ipe.intervention_event_id = s.id and ipe.deleted_at is NULL
  WHERE sdv.id is NULL
  AND ipe.id is NULL
  AND s.deleted_at is NULL
);

The EXPLAIN looks like

Update on study_data_events sde  (cost=48822.30..48822.62 rows=1 width=132)
   ->  Nested Loop  (cost=48822.30..48822.62 rows=1 width=132)
         ->  HashAggregate  (cost=48821.87..48821.88 rows=1 width=22)
               Group Key: s.id
               ->  Nested Loop Left Join  (cost=36390.15..48821.87 rows=1 width=22)
                     Filter: (sdv.id IS NULL)
                     ->  Hash Right Join  (cost=36389.71..48817.69 rows=1 width=16)
                           Hash Cond: (ipe.intervention_event_id = s.id)
                           Filter: (ipe.id IS NULL)
                           ->  Seq Scan on intervention_phase_events ipe  (cost=0.00..5637.57 rows=270448 width=14)
                                 Filter: (deleted_at IS NULL)
                           ->  Hash  (cost=29704.09..29704.09 rows=384610 width=10)
                                 ->  Seq Scan on study_data_events s  (cost=0.00..29704.09 rows=384610 width=10)
                                       Filter: (deleted_at IS NULL)
                     ->  Index Scan using study_data_values_study_data_event_id_idx on study_data_values sdv  (cost=0.43..3.31 rows=87 width=14)
                           Index Cond: (study_data_event_id = s.id)
                           Filter: (deleted_at IS NULL)
         ->  Index Scan using study_data_events_pkey on study_data_events sde  (cost=0.42..0.72 rows=1 width=106)
               Index Cond: (id = s.id)

We have this seq scan

->  Seq Scan on study_data_events s  (cost=0.00..29704.09 rows=384610 width=10)
    Filter: (deleted_at IS NULL)`

that is very costly and I'm wondering what I can do about it.

I've tried adding a variety of indices on s.deleted_at none of which have helped. Has anyone needed to filter a large dataset on deleted_at IS NULL?

rishi-jisr commented 2 years ago

@spiicychknsandy Did you find any solution for this ?