tekumara / fakesnow

Fake Snowflake Connector for Python. Run, mock and test Snowflake DB locally.
Apache License 2.0
108 stars 10 forks source link

feat: Adds MERGE INTO transform #109

Closed jsibbison-square closed 1 month ago

jsibbison-square commented 4 months ago

This commit adds the ability to convert snowflakes MERGE INTO functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed independently.

This commit only adds the transform, there is more refactoring required in fakes.py in order to handle a transform that transforms a single expression into multiple expressions so it cannot be used yet. I have a change baking for adding it into fakes but wanted to separate PRs.

Let me know if you need more tests or anything as its quite a complicated feature.

tekumara commented 4 months ago

Thanks for starting this! It's a tricky feature.

I hope you don't mind but I've pushed 40ada94 as a test case to cover. It comes from the Snowflake MERGE docs. It provides good coverage with a mix of delete, update and insert branches with some conditionals:

MERGE INTO t1 USING t2 ON t1.t1Key = t2.t2Key
    WHEN MATCHED AND t2.marked = 1 THEN DELETE
    WHEN MATCHED AND t2.isNewStatus = 1 THEN UPDATE SET val = t2.newVal, status = t2.newStatus
    WHEN MATCHED THEN UPDATE SET val = t2.newVal
    WHEN NOT MATCHED THEN INSERT (t1Key, val, status) VALUES (t2.t2Key, t2.newVal, t2.newStatus);

It also demonstrate that MERGE needs to:

  1. return a result set like this
  2. have branch conditions that are isolated and don't interact. Specifically in this example the last INSERT branch should not re-add rows removed by the DELETE branch.

One way to solve these is to calculate the merge operations up front using the original table and store then as a MERGE_OP column in a temporary table. This ensures the branch conditions work with the state of the original table, not the table after it has already been modified by a DML statement run for a previous branch.

So something like this for the test case above:

CREATE OR REPLACE TEMPORARY TABLE merge_candidates AS
SELECT
    t1.t1Key AS t1_t1Key,
    t1.val AS t1_val,
    t1.status AS t1_status,
    t2.t2Key AS t2_t2Key,
    t2.marked AS t2_marked,
    t2.isNewStatus AS t2_isNewStatus,
    t2.newVal AS t2_newVal,
    t2.newStatus AS t2_newStatus,
    CASE
        WHEN t1.t1Key = t2.t2Key AND t2.marked = 1 THEN 1           -- DELETE
        WHEN t1.t1Key = t2.t2Key and t2.isNewStatus = 1 THEN 2      -- UPDATE with conditional
        WHEN t1.t1Key = t2.t2Key THEN 3                             -- UPDATE the rest
        WHEN t1.t1Key IS NULL THEN 4                                -- INSERT remaining matches
        ELSE 0                                                      -- IGNORE
    END AS MERGE_OP
    FROM t1
FULL OUTER JOIN t2 ON t1.t1Key = t2.t2Key
WHERE MERGE_OP > 0;

This can then be used to drive the DML statements for each branch, and provide the result set, like this:

DELETE FROM t1
USING merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1key
  AND mc.merge_op = 1;

UPDATE t1
SET val = mc.t2_newVal,
    status = mc.t2_newStatus
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 2;

UPDATE t1
SET val = mc.t2_newVal
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 3;

INSERT INTO t1 (t1Key, val, status)
SELECT mc.t2_t2Key, mc.t2_newVal, mc.t2_newStatus
FROM merge_candidates AS mc
WHERE mc.merge_op = 4;

SELECT COUNT_IF(merge_op in (4)) AS "number of rows inserted", COUNT_IF(merge_op in (2,3)) AS "number of rows updated", COUNT_IF(merge_op in(1)) AS "number of rows deleted"
from merge_candidates;

Let me know what you think, and if this makes sense!

jsibbison-square commented 4 months ago

Thanks for the detailed feedback, seems merge is a lot more intricate than my usecase of upsert 😂 . I'm on holiday next week so it might be awhile before you see more progress on this. Thanks for the testcase.

jsibbison-square commented 3 months ago

Marked PR as WIP while I work through getting this functionality fully running. I've got the new test case running with a second attempt but I've got lots to clean up before its ready for review.

I'll mention you when its ready for review again.

tekumara commented 2 months ago

How did you go @jsibbison-square? Do you need any help with this?

jazzarati commented 2 months ago

How did you go @jsibbison-square? Do you need any help with this?

Hi @tekumara sorry I've run short on time at the moment. This was working as far as I was concerned but I haven't had time to refactor as its a bit messy. Happy for it to be merged as is and refactor one day or you can take over if you want the feature sooner.

jsibbison-square commented 2 months ago

@tekumara I've now got a passing build, isolated merge logic into its own files and added some documentation so I can hope to recall how it works in the future. Please review 🙏

tekumara commented 2 months ago

Thanks so much @jsibbison-square for this significant piece of work. I've pushed some refactoring to use pure functions rather than a class, as there's no permanent state to capture, and turned your comment documenting the sql output into a test.

I'm also considering whether it's worth refactoring this to have a single merge_candidates table and statement to generate the merge operations as the comment above, rather than the two tables and statement per when clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔

The main thing blocking merging this though, is it doesn't yet solve for a USING clause with a SELECT expression. See the example here. I'm not expecting you to solve this btw, I'm happy to noodle on it.

tekumara commented 1 month ago

Thanks again @jsibbison-square - I've merged this and will address the above in a follow-up PR.

jazzarati commented 1 month ago

I'm also considering whether it's worth refactoring this to have a single merge_candidates table and statement to generate the merge operations as the https://github.com/tekumara/fakesnow/pull/109#issuecomment-2198181692, rather than the two tables and statement per when clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔

I did start with one table but then went with two, I don't recall exactly why sadly.

Thanks for merging in its partial support form, we have a lot of usecases that are the most basic of merges so it'll be great to start using.

tekumara commented 1 month ago

Glad to hear you'll be using this soon. I've raised #136