mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
MIT License
2.07k stars 100 forks source link

Add explicit upsert (update + insert commands) #11

Closed apisarenco closed 5 years ago

apisarenco commented 5 years ago

ON CONFLICT DO UPDATE syntax requires mentioning the unique key that is present in the target table, for it to work. This makes upserting, using this method, into partitioned tables impossible in Postgresql 10, and only possible in Postgresql 11 if it's partitioned on a scalar value (not expression). This makes it difficult to upsert into a table partitioned by a chunk, with the chunking function present only in the DWH.

For example:

CREATE TABLE foo (
    id1,
    id2
) PARTITION BY (compute_chunk(id2));
CREATE TABLE foo_1 PARTITION OF foo FOR VALUES IN (1);

Will not support this style of partitioning, in Postgresql 11.

This is solved by this commit, that adds the option to do, instead of this:

INSERT INTO foo
SELECT * FROM foo_upsert
ON CONFLICT (id1) DO UPDATE SET id1=EXCLUDED.id1, id2=EXCLUDED.id2;

To do this:

UPDATE foo dst
SET id1=src.id1, id2=src.id2
FROM foo_upsert src
WHERE src.id1=dst.id1;

INSERT INTO foo
SELECT src.*
FROM foo_upsert src
-- Basically a memory-efficient "NOT IN (SELECT whatever)" query
LEFT JOIN foo dst
  ON src.id1=dst.id1
WHERE dst.* IS NULL;

This does not require the presence of any keys at all, but if keys are present, the performance is the same or better.

By default, this behavior is disabled, but can be enabled in the pipeline definition of the incremental load procedure with use_explicit_upsert=True constructor argument