mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
MIT License
2.07k stars 100 forks source link

Speedup incremental insert into partitioned tables #21

Closed jankatins closed 4 years ago

jankatins commented 4 years ago

When having a partitioned table where incremental load is inserted into, this speeds up the insert by making scanning the dst table parallel

Comparing the following queries (all resulting in zero results as the insert already happened):

-- Uses Append on a Seq Scan on all children with a Hash Right Join
Explain Analyse
SELECT src.*
FROM os_data.whatever_upsert src
LEFT JOIN os_data.whatever dst
  ON dst."id" = src."id"
WHERE dst.* IS NULL
-- Execution Time: 58464.012 ms
-- Uses Parallel Append on a Parallel Seq Scan with a Parallel Hash Anti Join
Explain Analyse
SELECT src.*
FROM os_data.whatever_upsert src
LEFT JOIN os_data.whatever dst
  ON dst."id" = src."id"
WHERE dst.id IS NULL
-- Execution Time: 14804.001 ms
-- Again Uses Parallel Append on a Parallel Seq Scan with a Parallel Hash Anti Join
-- when I developed this with less upsert data it used the Index in the id column
Explain Analyse
SELECT src.*
FROM os_data.job_match_2019_upsert src
WHERE NOT EXISTS (SELECT 1 FROM os_data.job_match_2019 dst WHERE dst."id" = src."id")
-- Execution Time: 14465.946 ms

As I currently see no performance difference between the last two and the last one once used index, I kept that version.

Tested on Pg11, Mac

apisarenco commented 4 years ago
WHERE NOT EXISTS (SELECT 1 FROM os_data.job_match_2019 dst WHERE dst."id" = src."id")

I'd test this in more cases, because we've resorted to not use such constructions because the planner would sometimes choose to do a nested loop (older PG?).

martin-loetzsch commented 4 years ago

I tested it in two projects (PG 11 & 12) and it worked (couldn't see how it's faster, didn't have patience to wait for new data for upsert)