Closed rupurt closed 7 months ago
Nice idea, do you have a specific thing you want to do in mind? I wonder if, as a first step it would be sufficient to add a few methods and inherit, or if arrow-odbc
would have to switch to the arrow stream interface altogether.
I'm trying to plug it into the duckdb arrow reader.
How is arrow-odbc
helpful at all if you already have a reader emitting batches?
What I really want is the reader emitting the arrow batch interface so duckdb can integrate with it directly. There's also a bunch of other arrow API's that only work with known arrow types. I've currently wrapped them but ideally I wouldn't need to.
The part that confuses me is that following the link to the duckdb arrow reader interface, it seems to me that this is an API that emits record batches, but what I read in your description is that you want to consume record batches emitted by arrow-odbc
. I would like to understand into which API call you want to put the RecordBatchReader which would be emitted by read_arrow_batches_from_odbc
.
This is the documentation for Arrow batch reader. DuckDB can consume that natively.
I frequently use this pattern in my ETLs, using DuckDB for joining cross-database sources and transformation logic:
query = arrow_odbc.read_arrow_batches_from_odbc(sql, connect_string)
reader = pyarrow.RecordBatchReader.from_batches(query.schema, query)
with duckdb.connect("temp_db.duckdb") as duck:
duck.from_arrow(reader).create("my_table")
duck.sql("select * from my_table").to_parquet("output.parquet")
(Example uses DuckDB's lazy Relational API). An arrow-odbc function which returns a RecordBatchReader directly would be nice to have, rather than converting the BatchReader iterator on line 2.
Thanks @timrburnham thats essentially how I’m currently wrapping it. Not too much effort but would be nice to skip that step.
@timrburnham thanks fro providing the example, I could answer a number of questions I had with this.
My thoughts so far. I have been aware of the Arrow C streaming interface. It is just a bit younger than this library. Migrating to it would not be too big of a change codewise, it is reasonable close to the C interface, which is already in place. The part which would take effort would actually be the build system which I suspect would get more complicated. The benefits would be less in-directions if calling a method, compared to the wrapping you demonstrated. Not sure if that would ever make a difference though performance wise.
Also you did not ask for performance improvements. You asked for code ergonomics.
I am skeptical about directly returning a RecordBatchReader
as this is not an abstact base class nor an implicit or explicit protocol. Instead it seams more like a type erasure type and I am unsure how I would extend its functionality with things specific to this ODBC reader. E.g. more_results
currently allows to jump to the next Result Set from the database, in case the statement or stored procedure return more than one result set.
Currently I think one would still need a way for the user to explicitly opt out of all the ODBC specific functions and transition explicitly to the pyarrow RecordBatchReader. Something like a to_pyarrow_batch_reader
. This function could be implemented with what you do now: pyarrow.RecordBatchReader.from_batches(query.schema, query)
.
Would you consider this an improvement?
Best, Markus
Closing this issue. Some final notes on design here, so I remember myself.
Inheriting directly from PyArrows RecordBatchReader
would imply using its C-Interface exclusively. It is basically a virtual function table + pointer to the object. This interface is not. So I can not come up with C-Function of my own which I want to support. This is fine to a degree, because I should not be able to extend the contract of a third party library. It is also hard to extend for my specialization of the contract.
Any of this would actually not be a problem, if DuckDb
would stick to the generic roots of Python. Ironically DuckDBs python API does not use Duck typing. If so, it could all be solved by choosing the same method names for this library and PyArrow. Instead it seems to me DuckDB chooses to explicitly check for inheritance of that interface.
Here the two philosophies of inherting to be reused and inherting to reuse clash.
There may be a solution space in which I still extend the Arrow C-Interface, by directly accessing the handle pointer in their struct. Yet I would likely need to make assumptions of how the PyArrow interacts with its C-Interface. E.g. in cases there the schema whould change, because we go to the next result set.
For now I find it safer to have an ODBC specific interface and a PyArrow interface and an explicit conversion function between them.
Best, Markus
This will allow
arrow-odbc
to be used with the pyarrow dataset APIhttps://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html