snowflakedb / snowflake-kafka-connector

Snowflake Kafka Connector (Sink Connector)
Apache License 2.0
137 stars 98 forks source link

UPSERT Support #505

Open miramar-labs opened 1 year ago

miramar-labs commented 1 year ago

I'm using the JDBC Source connector to bring a MSSQL table into kafka as a topic. I have two sink connectors configured on that topic .. the JDBC Sink connector (pushing the data back out to an identical table in MSSQL) and the Snowflake Sink Connector (where I'm trying to do the same, but as a Snowflake table). The JDBC sink seems to handle upserts just fine, in that say the source table contains 3 rows and I update one of them, the destination table contains exactly 3 rows with one updated. The snowflake sink does not seem to work this way ... updates appear in the destination table as new rows (with the same key but different timestamps). Q: How can I do proper UPSERT to a Snowflake table? Or is this not supported?

miramar-labs commented 1 year ago

I have also tried creating a stream on the raw VARIANT snowflake table, with the idea that I could use a task to periodically run a MERGE into a target table, but no matter if I INSERT or UPDATE rows in the source database, the stream always shows them as INSERTS.. so the merge won't work in that case.

lyuboxa commented 8 months ago

Any plans to support update/upsert functionality in the near future? I'm curious how that plays with the snow pipe streaming?

airbots commented 1 month ago

Was this can be resolved by a delete and a insert?

lyuboxa commented 1 month ago

One of the unanswered questions is around cost effectiveness. A merge requires compute time but so does a snowpipe processing. Thus what will be more effective way to have upserts at reasonable cost:

  1. Stage files, import via snow pipe in a staging table, merge table to live table
  2. Stage files, perform merge from staged file in live table
  3. option?

Additionally if order is to be preserved between create/update on the same record instance (say merge over id), there is some order in which these merges should happen, so more of a stack than a queue.