Snowflake integration - Githubissues

joshuataylor commented 2 years ago

Hi!

At the company I work for we use Snowflake. It's pretty good as a data warehouse. It uses JSON as a response format, and also Arrow for supported languages (as Erlang/Elixir doesn't have a library for this, we would have to use a NIF for Rust to get this to work, and the current JSON responses works "good enough" for us). See the blog post about arrow

I also love Elixir, and so does the company I work for.

I have written a Snowflake library for Elixir, and also an Ecto integration, both of which we use to drive internal apps. Works pretty well, however I want to clean up the library a bit. I'm looking over req_bigquery now as well.

I'm looking now into adding Snowflake support, and will support it via a feature branch on my fork until I've got the public snowflake_elixir libraries to a standard I would be comfortable with.

TLDR: 1/ Create a generic snowflake_elixir library, similar to req_bigquery. Maybe req_snowflake ;) 2/ Make snowflake_elixir_dbconnection rely on this 3/ Make snowflake_elixir_ecto rely on snowflake_elixir_dbconnection

Thoughts?

edit: After an hour, got the table integration in and it works! I'll finish off the tests then create a PR.

josevalim commented 2 years ago

Hi @joshuataylor! We actually discussed Snowflake and did early prototypes (aiming for the Arrow version), we made some progress but we did not fully conclude it.

In any case, we agree that getting started with the JSON one is good enough! However, we were really planning to go the req_snowflake route. SnowflakeEx works great for Ecto projects, but I believe the req_snowflake route will be considerably leaner, which is important to reduce the notebook footprint. What do you think?

You may actually even be able to go the other way and have SnowflakeEx use req_snowflake. Is this a route you would be willing to explore?

Thank you! :heart:

joshuataylor commented 2 years ago

Yes, I'd be happy to create a version for req_snowflake, as we would use it.

Initially this would use JSON, but we can work on getting it to work for Arrow.

I've worked quite a bit on reverse engineering the Snowflake protocol, so happy to champion that.

josevalim commented 2 years ago

@joshuataylor let's create a separate issue then for Snowflake Arrow. My understanding was that they dump the results to a S3 (or S3-like storage). This is a similar design to AWS Athena, so we can also wait until req_athena is done before we focus on Arrrow support. For parsing the results, we can use Explorer (as an optional dependency).

@aleDsz, can you please get started with the issue and dump your findings?

joshuataylor commented 2 years ago

Yeah it's in a streaming format for arrow. I also wrote a connector for rust. You download the parts as you need, but that's unrelated to the format.

joshuataylor commented 2 years ago

So I've started working on req_snowflake, and should have a repository for review and comments in the next day or so.

I haven't used Bigquery, so I'm not sure how that works -- but req_bigquery looks great.

Snowflake is considered a "Data Warehouse", but can be used almost exactly the same way Postgres is used, which is why snowflake_elixir was pretty straight forward to write once I had figured out how the Snowflake REST API worked and tied in with db_connection pretty easily. We've been using this for 18 months for our internal apps at work and it's been working quite well. Porting to req_snowflake was pretty straight forward once I understood how req works with request/response steps. I really like it.

I have spent a considerable amount of time spelunking and figuring how Snowflakes REST API internals works (Arrow, JSON responses, header quirks etc), so I'm also happy to chat about what I've learned about their implementation as it's a bit wonky in some places, but overall pretty good. Their Python/Golang/NodeJS connectors are pretty good now as well and you can read through their code to figure out how they do things. I think we'll copy the way they do Snowflake Stages as well.

Also, as a random note but about 2 years ago I worked on an query editor style app, similar to Heroku Dataclips, which was built for Snowflake but I gave up on it as some of the technology wasn't there yet. But livebook/Table is exactly what I've been wanting, and it's very polished. So I plan on opensourcing this work and porting this to work in livebook. I reckon a proper query editor would be a killer feature for Livebook, with autocompletions on par with Datagrip (the best SQL editor I've used, hands down) would be great.

josevalim commented 2 years ago

Perfect, we would be very happy to discuss and explore those ideas with you!

joshuataylor commented 2 years ago

I have created a PR here: https://github.com/joshuataylor/req_snowflake/pull/6

Please give feedback if you want, and to anyone else reading this would also be appreciated.

I am happy to also eventually move this to a generic repo of req plugins if needed.

Maybe we should reopen this ticket when this repo is ready?

josevalim commented 2 years ago

Amazing! @wojtekmach is definitely the best person to review it. Also, I think we can keep this issue open too!

Btw, would you be ok with transferring the repo to the livebook org before merging the kino_db integration? Of course you would retain your access, but we also want to have access over all integrations that can be started with a single click from the UI. :)

Thank you! <3

joshuataylor commented 2 years ago

Yep, pinged him on Slack about this as well, as I wanted to reach out about some req questions.

Also yep -- i'm happy to move it when required, not sure of the process.

josevalim commented 2 years ago

@joshuataylor the easiest way I know is to transfer it to me, then I transfer to the org. But let's not worry about it right now :D

aleDsz commented 2 years ago

@joshuataylor Hi, awesome to see you have already know how Snowflake REST API works, which is what I was trying to figure out how to receive the Arrow response format.

I've read the Python Connector repo to see how to "personificate" the Elixir version as a "Python version" to make them to return the response as ARROW format, but didn't have so much progress with that.

Looking to your current PR from req_snowflake I'm very excited to see how we could support the ARROW format, but happy to see your initial work with JSON.

As you need to use S3, what do you think about using req_s3?

joshuataylor commented 2 years ago

If you set the application to mimic other apps (Python/Java/etc) it will give you sometimes Arrow, other times JSON. It seemed (from 2 years ago, could have changed) that using any name apart from Javascript and that version returned Arrow.

I ended up converting Arrow using a NIF to get it to work, but found JSON easier to manage (at the time). I'm happy to get this working with Arrow, now that Rustler has precompiled support I'll get CI to build for every platform (including manually building using Mac M1 & ARM if Github doesn't support it yet).

Happy to try req_s3, though i think we might need to support the header here for it to get the response back in the correct format. i'll try removing it and see if it breaks.

wojtekmach commented 2 years ago

I wouldn't add req_s3 as a dependency to your plugin. It's only really meant for end-users to easily explore stuff in S3. It's pretty straightforward too, if you find any bits useful, I'd copy them instead.

josevalim commented 2 years ago

Hi @joshuataylor! This is just a quick ping to check if there is anything we can help with.

Regarding Arrow support, note that the Explorer library can read Arrow IPC. At the moment the IPC needs to be stored in disk but we can easily add a function to read it from an Elixir binary. If the read_ipc from disk approach works, then I would actually go with Explorer, because you most likely want a dataframe later to explore the data anyway! Also note Explorer supports only some types (:integer | :float | :boolean | :string | :date | :datetime) but we can also add new data types if so required.

joshuataylor commented 2 years ago

Hi!

Yep, just been going down the rabbithole in the 2 weeks of playing with Arrow2, Rustler, NIFs and converting to Binaries. I'm currently wrapping up some benchmarks, but we can convert everything in Rust to Elixir terms extremely quickly, the main blocker I had was figuring out how binaries and subbinaries work, major thanks to @filmor for pointing to sub binaries which has massively decreased times.

I'll have something out in the next few days, but I've found that we're much faster than the JSON implementation, and I've also improved the JSON implementation as well so the user has a choice to either:

a/ Use JSON with Jason to decode b/ Use Arrow

I'll also play with explorer to see how it compares to doing things in Elixir vs not.

I'm also curious how we can quickly convert from columns -> rows so the rows are in lists, see here: https://github.com/joshuataylor/list_of_lists

We get the following back from Arrow:

[
  [1,2,3,4],
  ["a","b","c","d"]
]

Then I do:

    data
    |> Native.convert_arrow_stream(Keyword.get(opts, :cast, false))
    |> Enum.zip_with(& &1)

edit: So Arrow has two different formats for IPC:

1/ IPC 2/ Arrow streaming files

We'll need to add 2/ to polars, which is noted here: https://github.com/pola-rs/nodejs-polars/issues/109

josevalim commented 2 years ago

That's awesome! Keep in mind that subbinaries have the downside that the underlying binary won't be garbage collected but that should be alright given Arrows' design.

The cool think about Explorer is that, as we improve its APIs, you will be able to build large queries that filter, group_by, order, and execute that only once over the Arrow format in Rust. The conversion to Elixir happens only at the end (as in dataframes in Python).

The other thing about Explorer is that we will likely implement a ExplorerSQL thing, so you can write your SQL queries in Explorer too, translate that to SQL, send to Snowflake (likely using req_snowflake), and continue processing it with Explorer locally. So if you want to go with Arrow out of the box, it is likely the best path forward.

In any case, it is great you are deeper into the Arrow/Rust/NIF side of things, because it is becoming an essential part of the data aspect of Livebook/Nx, and the knowledge will definitely be useful and help!

joshuataylor commented 2 years ago

Perfect, that is the exact use case I also wanted for a tool. Sounds very similar to the way count.co does things, they have a notebook which converts each cell into actual SQL which is executed against Snowflake.

forest commented 1 year ago

If there is still interest in supporting Snowflake via the JSON API then this library might be useful. It is built on top of Req. https://github.com/HGInsights/avalanche

joshuataylor commented 1 year ago

Apologies for the lack of updates on this, life ended up being pretty crazy this year with personal family stuff going on.

Snowflakes API is incredibly undocumented, but is stable and never really changes, as they have large customers who can't have the API change. I'm happy to chat over the API and any nuances, especially regarding JSON vs Arrow rows, etc with anyone -- feel free to email me at joshuataylorx [at] gmail [dot] com and we can either discuss via email or catchup over video.

josevalim commented 1 year ago

For what is worth, we have Elixir+Snowflake+Arrow support directly in ADBC now. Our goal is to integrate it into this project via with Explorer+ADBC.

forest commented 1 year ago

Oh, that's great. I will give ADBC a try.

josevalim commented 1 year ago

This is in!

livebook-dev / kino_db

Snowflake integration #12