tekumara / fakesnow

Fake Snowflake Connector for Python. Run, mock and test Snowflake DB locally.
Apache License 2.0
108 stars 10 forks source link

Support for loading data from stages #35

Open tharwan opened 9 months ago

tharwan commented 9 months ago

Hi,

Would it be possible to support loading of data from a stage? (https://docs.snowflake.com/en/user-guide/data-load-considerations-load)

I am not sure from the docs if this would be even inside the scope of this project.

However it would help us a lot to test our complete ETL workflow.

tekumara commented 9 months ago

Hi @tharwan this might be possible ... could you share example SQL statements you'd like support for, both creating the stage and loading from it?

tharwan commented 9 months ago

Loading from a stage looks something like this:

COPY INTO my_table
FROM @my_stage/my_file
FILE_FORMAT = (FIELD_DELIMITER = ';')

Creating a stage is not so relevant in our case but looks like this:

CREATE STAGE my_ext_stage
URL='azure://myaccount.blob.core.windows.net/load/files/'
STORAGE_INTEGRATION = myint;

maybe related: https://github.com/tobymao/sqlglot/issues/2463

tekumara commented 9 months ago

Thanks! From the above I gather you want support for CSV files.

I think creating the stage would be needed for fakesnow to know where to find the files to load.

tharwan commented 9 months ago

another option could be to just look locally for a file and ignore the @my_stage part.

e.g.

COPY INTO my_table
FROM @my_stage/my_file
FILE_FORMAT = (FIELD_DELIMITER = ';')

would translate to

COPY my_table
FROM my_file
(FORMAT CSV, DELIMITER ';');
DanCardin commented 3 months ago

we recently found ourselves wanting this at $job. i may look into a relatively simple initial implementation somewhat soon. My personal plan/preference would be for fakesnow to simply internally track creations of stages inside information_schema.stages and something else for integrations (i dont see a table for this, but we're using integrations personally).

And then translate COPY statements against the service in question into boto3 calls + insert statements (which ought to work for all the supported cloud providers iirc)

It'd then be the job of the user to set up moto however they prefer, to avoid making calls to the actual service. Perhaps fakesnow could provide a default fixture that would ensure it'd avoid making calls by default, but i'd suspect we at least, would be forced to override it anyways.


like i said, i'll probably be looking into the feasibility of this strat soon, but i expect sqlglot to be the short term limiting factor in terms of minimally parsing the integration/stage/copy statement syntaxes respectively.

tekumara commented 2 months ago

If the use-case were just testing then I thought for the first iteration of this we could just support a local file via a file:// url. It would avoid the need to set up actual s3/azure storage or moto to mock s3. Would that work?

DanCardin commented 2 months ago

my personal usecase requires a storage integration/s3, but presumably my implementing that will be easier if there was already general support against a staged file://

tekumara commented 2 months ago

Yes good point - if we transform a COPY INTO into a INSERT INTO .. SELECT in duckdb, then we could support both a file:// and s3:// url via the same mechanism relatively easily, using duckdb's support for s3 via the https extension.

If you're keen to tackle this a PR would be welcome!