manzt / quak

a scalable data profiler
https://manzt.github.io/quak/
MIT License
201 stars 9 forks source link

feat: Support for Arrow PyCapsule interface #23

Closed kylebarron closed 1 month ago

kylebarron commented 1 month ago

The Arrow PyCapsule Interface defines a way for Python libraries to exchange Arrow data at the binary level without needing to know library-specific APIs of producer or consumer. If an object exposes the __arrow_c_stream__ dunder method, then you can pass that into another Arrow library's constructor, which will call that method to get the PyCapsule object representing a pointer to an Arrow C stream.

I've been working to promote its adoption throughout the Python Arrow ecosystem.

Because the PyCapsule Interface makes it free to share data across libraries, I've also been working on arro3, a minimal alternative to pyarrow.

With this PR, any object that implements this interface will just work with quak:

image

Since I just had my PyCapsule interface PRs merged in Polars, with the next release of python polars, this will just work, (without going through the DataFrame API; if Polars and DuckDB and Quak all speak Arrow, why go through the DataFrame API?).

DuckDB hasn't implemented the interface itself, so for now I believe you have to go through pyarrow.

kylebarron commented 1 month ago

I made a few changes and now I think this should be ready for review as long as we're ok materializing the full Arrow input stream. I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.

manzt commented 1 month ago

I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.

I agree. Do you think we could change the behavior in a follow up PR or would that be something we want to handle here? The main thing we need to do is give a table_name for the front end to dispatch queries against.

We could create a duckdb pyrelation from the arrow stream, for example this API is already supported:

conn = duckdb.connect(":memory:)
conn.sql("CREATE VIEW df AS SELECT * FROM ???") # consume arrow c stream
widget = quak.Widget(conn, table="df")
kylebarron commented 1 month ago

I think it's fine to do in a follow up PR; up to you

manzt commented 1 month ago

Ok, let's just merge for now!

manzt commented 1 month ago

https://github.com/manzt/quak/releases/tag/v0.1.6