kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
91 stars 83 forks source link

Snowflake Data Connectors (SnowPark) #108

Closed yetudada closed 1 year ago

yetudada commented 1 year ago

Description

I think there's scope to create a series of data connectors that would allow Kedro users to connect to Snowflake in different ways. This usage pattern was identified in the kedro-org/kedro#1653 research, that sometimes our users want to leverage SQL-based workflows for their data engineering pipelines. These connectors essentially simplify use of Python when you need it for the data science part of your workflow.

While I have created this issue, I think it's important to document why we have seen users create Snowflake datasets and not leverage our pandas.SQLTableDataSet and pandas.SQLQueryDataSet to do the same, especially in the case of workflows based on Pandas.

Possible Implementation

This tasks proposes building out the following data connectors:

datajoely commented 1 year ago

So I've been wanting to work on this for a while - in my opinion the right way to approach this is to use the new SnowPark approach which exposes an almost identical API to Spark DataFrames (context hook, data frame class).

https://docs.snowflake.com/en/developer-guide/snowpark/python/index.html

Vladimir-Filimonov commented 1 year ago

Implementing via Snowpark is tempting but I would be cautious using Snowpark Python for production workloads as it is still in preview

Vladimir-Filimonov commented 1 year ago

Nevermind, snowpark python 1.0 just arrived: https://pypi.org/project/snowflake-snowpark-python/

datajoely commented 1 year ago

Excellent news!

I'm a big proponent of going the Snowpark route - to do this right I think we need to do 3 things. In the following sequence as a separate PRs.

  1. We need to implement SnowParkDataSet, this should mostly be a copy paste of SparkDataSet. It will be a much simpler implementation as we don't need to worry about file paths, just table schemas and names.

  2. We need to introduce a Kedro starter that works the same way as the pyspark one we have today. This will give users a ready to go Snowflake example doing kedro --new starters snowflake. We would also need to add it to the default starter scope here.

  3. We would then need some lightweight docs explaining how to use it!

  4. (Extra credit) ensure transcoding to pandas works okay :)

datajoely commented 1 year ago

I hope to eventually get to this, but any help would be greatly appreciated!

marrrcin commented 1 year ago

@datajoely We (GetInData) have recently researched the topic of Snowpark + Kedro extensively and we will be happy to take over this feature and help to implement it. Let's discuss the details 🙂

datajoely commented 1 year ago

@marrrcin that's awesome - I connected with @Vladimir-Filimonov and @heber-urdaneta last week who have started work on the prototype. Are you able to raise your PRs even if they're in draft and we can discuss next steps together?

heber-urdaneta commented 1 year ago

@datajoely sure, just created the draft PR (still in progress), we can discuss further

datajoely commented 1 year ago

Awesome - I've done a mini review of it's in a really good place

datajoely commented 1 year ago

I'm still not sure how to write integration tests for this without an actual snowflake instance spun up.

Looking at some examples in the Snowflake Labs repo perhaps we can do it like this: https://github.com/Snowflake-Labs/snowpark-devops/blob/main/tests/procedure_test.py

Vladimir-Filimonov commented 1 year ago

I'm still not sure how to write integration tests for this without an actual snowflake instance spun up.

Looking at some examples in the Snowflake Labs repo perhaps we can do it like this: https://github.com/Snowflake-Labs/snowpark-devops/blob/main/tests/procedure_test.py

We reached out to Snowflake folks asking if we can expect any mocks provided as part of Snowpark and making unit-tests easier, but it doesn't seem something like this can be expected near term. So I'm afraid for us to cover methods like _save with unit tests will have to use real Snowflake instance. Let me push a proposal of what tests can be done locally vs require real SF early next week

heber-urdaneta commented 1 year ago

@datajoely @marrrcin thanks for your comments on the PR! @Vladimir-Filimonov took a shot at addressing the feedback, just created a new PR: https://github.com/kedro-org/kedro/pull/2032

merelcht commented 1 year ago

A Snowpark dataset was added in https://github.com/kedro-org/kedro-plugins/pull/104 and released in kedro-datasets 1.1.0.