Closed kylebarron closed 1 month ago
I made a few changes and now I think this should be ready for review as long as we're ok materializing the full Arrow input stream. I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.
I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.
I agree. Do you think we could change the behavior in a follow up PR or would that be something we want to handle here? The main thing we need to do is give a table_name
for the front end to dispatch queries against.
We could create a duckdb pyrelation from the arrow stream, for example this API is already supported:
conn = duckdb.connect(":memory:)
conn.sql("CREATE VIEW df AS SELECT * FROM ???") # consume arrow c stream
widget = quak.Widget(conn, table="df")
I think it's fine to do in a follow up PR; up to you
Ok, let's just merge for now!
The Arrow PyCapsule Interface defines a way for Python libraries to exchange Arrow data at the binary level without needing to know library-specific APIs of producer or consumer. If an object exposes the
__arrow_c_stream__
dunder method, then you can pass that into another Arrow library's constructor, which will call that method to get the PyCapsule object representing a pointer to an Arrow C stream.I've been working to promote its adoption throughout the Python Arrow ecosystem.
Because the PyCapsule Interface makes it free to share data across libraries, I've also been working on arro3, a minimal alternative to pyarrow.
With this PR, any object that implements this interface will just work with quak:
Since I just had my PyCapsule interface PRs merged in Polars, with the next release of python polars, this will just work, (without going through the DataFrame API; if Polars and DuckDB and Quak all speak Arrow, why go through the DataFrame API?).
DuckDB hasn't implemented the interface itself, so for now I believe you have to go through pyarrow.