scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra/parquet files. Alt. from DynamoDB to Scylla Alternator.
https://migrator.docs.scylladb.com/stable/
Apache License 2.0
55 stars 34 forks source link

[Question] Possible to use spark-snowflake connector as data sink? #25

Open kharmabum opened 3 years ago

kharmabum commented 3 years ago

My team is exploring the use of this tool for reconstructing several tables that are live in production that need to be recreated with new partitioning strategies (and data migrated over).

If we were able to also leverage this tool as a way to bulk load data into Snowflake; it would would it significantly more appealing to us.

Could anyone provide a bit of insight into level of effort/cost that would be associated with adapting this tool to accept the spark-snowflake connector as a sink.

Apologies if the terminology here is a bit scrambled I'm still getting up to speed on Spark.

dorlaor commented 3 years ago

Since the tool supports CQL and DynamoDB already, it shouldn't be too complicated to add another source/target. For an expert, it's a couple of days work. Patches are more than welcomed.

On Tue, Sep 15, 2020 at 11:06 AM Juan-Carlos Foust notifications@github.com wrote:

My team is exploring the use of this tool for reconstructing several tables that are live in production that need to be recreated with new partitioning strategies (and data migrated over).

If we were able to also leverage this tool as a way to bulk load data into Snowflake; it would would it significantly more appealing to us.

Could anyone provide a bit of insight into level of effort/cost that would be associated with adapting this tool to accept the spark-snowflake https://github.com/snowflakedb/spark-snowflake connector as a sink.

Apologies if the terminology here is a bit scrambled I'm still getting up to speed on Spark.

  • JC

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-migrator/issues/25, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHURMKZERRQAX4C3VEH7DSF6UKFANCNFSM4RNQL6QA .

tarzanek commented 3 years ago

@kharmabum this boils down to use it in the same way as writers in https://github.com/scylladb/scylla-migrator/tree/master/src/main/scala/com/scylladb/migrator/writers so you would just add a writer and implement writeDataframe since https://github.com/snowflakedb/spark-snowflake is a datasource this should be straightforward as Dor correctly pointed out

kharmabum commented 3 years ago

@tarzanek @dorlaor thanks for the thoughtful responses. I think I might be able to take this on in a few months. It depends on when we update to 4.0 (and thus get access to CDC which would enable this pipeline for us). Thanks again.