Closed jsibbison-square closed 1 month ago
Thanks for starting this! It's a tricky feature.
I hope you don't mind but I've pushed 40ada94 as a test case to cover. It comes from the Snowflake MERGE docs. It provides good coverage with a mix of delete, update and insert branches with some conditionals:
MERGE INTO t1 USING t2 ON t1.t1Key = t2.t2Key
WHEN MATCHED AND t2.marked = 1 THEN DELETE
WHEN MATCHED AND t2.isNewStatus = 1 THEN UPDATE SET val = t2.newVal, status = t2.newStatus
WHEN MATCHED THEN UPDATE SET val = t2.newVal
WHEN NOT MATCHED THEN INSERT (t1Key, val, status) VALUES (t2.t2Key, t2.newVal, t2.newStatus);
It also demonstrate that MERGE needs to:
One way to solve these is to calculate the merge operations up front using the original table and store then as a MERGE_OP column in a temporary table. This ensures the branch conditions work with the state of the original table, not the table after it has already been modified by a DML statement run for a previous branch.
So something like this for the test case above:
CREATE OR REPLACE TEMPORARY TABLE merge_candidates AS
SELECT
t1.t1Key AS t1_t1Key,
t1.val AS t1_val,
t1.status AS t1_status,
t2.t2Key AS t2_t2Key,
t2.marked AS t2_marked,
t2.isNewStatus AS t2_isNewStatus,
t2.newVal AS t2_newVal,
t2.newStatus AS t2_newStatus,
CASE
WHEN t1.t1Key = t2.t2Key AND t2.marked = 1 THEN 1 -- DELETE
WHEN t1.t1Key = t2.t2Key and t2.isNewStatus = 1 THEN 2 -- UPDATE with conditional
WHEN t1.t1Key = t2.t2Key THEN 3 -- UPDATE the rest
WHEN t1.t1Key IS NULL THEN 4 -- INSERT remaining matches
ELSE 0 -- IGNORE
END AS MERGE_OP
FROM t1
FULL OUTER JOIN t2 ON t1.t1Key = t2.t2Key
WHERE MERGE_OP > 0;
This can then be used to drive the DML statements for each branch, and provide the result set, like this:
DELETE FROM t1
USING merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1key
AND mc.merge_op = 1;
UPDATE t1
SET val = mc.t2_newVal,
status = mc.t2_newStatus
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 2;
UPDATE t1
SET val = mc.t2_newVal
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 3;
INSERT INTO t1 (t1Key, val, status)
SELECT mc.t2_t2Key, mc.t2_newVal, mc.t2_newStatus
FROM merge_candidates AS mc
WHERE mc.merge_op = 4;
SELECT COUNT_IF(merge_op in (4)) AS "number of rows inserted", COUNT_IF(merge_op in (2,3)) AS "number of rows updated", COUNT_IF(merge_op in(1)) AS "number of rows deleted"
from merge_candidates;
Let me know what you think, and if this makes sense!
Thanks for the detailed feedback, seems merge is a lot more intricate than my usecase of upsert 😂 . I'm on holiday next week so it might be awhile before you see more progress on this. Thanks for the testcase.
Marked PR as WIP while I work through getting this functionality fully running. I've got the new test case running with a second attempt but I've got lots to clean up before its ready for review.
I'll mention you when its ready for review again.
How did you go @jsibbison-square? Do you need any help with this?
How did you go @jsibbison-square? Do you need any help with this?
Hi @tekumara sorry I've run short on time at the moment. This was working as far as I was concerned but I haven't had time to refactor as its a bit messy. Happy for it to be merged as is and refactor one day or you can take over if you want the feature sooner.
@tekumara I've now got a passing build, isolated merge logic into its own files and added some documentation so I can hope to recall how it works in the future. Please review 🙏
Thanks so much @jsibbison-square for this significant piece of work. I've pushed some refactoring to use pure functions rather than a class, as there's no permanent state to capture, and turned your comment documenting the sql output into a test.
I'm also considering whether it's worth refactoring this to have a single merge_candidates
table and statement to generate the merge operations as the comment above, rather than the two tables and statement per when
clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔
The main thing blocking merging this though, is it doesn't yet solve for a USING
clause with a SELECT
expression. See the example here. I'm not expecting you to solve this btw, I'm happy to noodle on it.
Thanks again @jsibbison-square - I've merged this and will address the above in a follow-up PR.
I'm also considering whether it's worth refactoring this to have a single merge_candidates table and statement to generate the merge operations as the https://github.com/tekumara/fakesnow/pull/109#issuecomment-2198181692, rather than the two tables and statement per when clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔
I did start with one table but then went with two, I don't recall exactly why sadly.
Thanks for merging in its partial support form, we have a lot of usecases that are the most basic of merges so it'll be great to start using.
Glad to hear you'll be using this soon. I've raised #136
This commit adds the ability to convert snowflakes MERGE INTO functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed independently.
This commit only adds the transform, there is more refactoring required in
fakes.py
in order to handle a transform that transforms a single expression into multiple expressions so it cannot be used yet. I have a change baking for adding it intofakes
but wanted to separate PRs.Let me know if you need more tests or anything as its quite a complicated feature.