pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
60 stars 5 forks source link

Return pyarrow.RecordBatchReader from read_arrow_batches_from_odbc #83

Closed rupurt closed 7 months ago

rupurt commented 10 months ago

This will allow arrow-odbc to be used with the pyarrow dataset API

https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html

pacman82 commented 9 months ago

Nice idea, do you have a specific thing you want to do in mind? I wonder if, as a first step it would be sufficient to add a few methods and inherit, or if arrow-odbc would have to switch to the arrow stream interface altogether.

pacman82 commented 9 months ago

Note to self: https://arrow.apache.org/docs/format/CStreamInterface.html

rupurt commented 9 months ago

I'm trying to plug it into the duckdb arrow reader.

pacman82 commented 9 months ago

How is arrow-odbc helpful at all if you already have a reader emitting batches?

rupurt commented 9 months ago

What I really want is the reader emitting the arrow batch interface so duckdb can integrate with it directly. There's also a bunch of other arrow API's that only work with known arrow types. I've currently wrapped them but ideally I wouldn't need to.

pacman82 commented 9 months ago

The part that confuses me is that following the link to the duckdb arrow reader interface, it seems to me that this is an API that emits record batches, but what I read in your description is that you want to consume record batches emitted by arrow-odbc. I would like to understand into which API call you want to put the RecordBatchReader which would be emitted by read_arrow_batches_from_odbc.

rupurt commented 9 months ago

This is the documentation for Arrow batch reader. DuckDB can consume that natively.

rupurt commented 9 months ago

cpp docs

timrburnham commented 9 months ago

I frequently use this pattern in my ETLs, using DuckDB for joining cross-database sources and transformation logic:

query = arrow_odbc.read_arrow_batches_from_odbc(sql, connect_string)
reader = pyarrow.RecordBatchReader.from_batches(query.schema, query)

with duckdb.connect("temp_db.duckdb") as duck:
    duck.from_arrow(reader).create("my_table")
    duck.sql("select * from my_table").to_parquet("output.parquet")

(Example uses DuckDB's lazy Relational API). An arrow-odbc function which returns a RecordBatchReader directly would be nice to have, rather than converting the BatchReader iterator on line 2.

rupurt commented 9 months ago

Thanks @timrburnham thats essentially how I’m currently wrapping it. Not too much effort but would be nice to skip that step.

pacman82 commented 9 months ago

@timrburnham thanks fro providing the example, I could answer a number of questions I had with this.

My thoughts so far. I have been aware of the Arrow C streaming interface. It is just a bit younger than this library. Migrating to it would not be too big of a change codewise, it is reasonable close to the C interface, which is already in place. The part which would take effort would actually be the build system which I suspect would get more complicated. The benefits would be less in-directions if calling a method, compared to the wrapping you demonstrated. Not sure if that would ever make a difference though performance wise.

Also you did not ask for performance improvements. You asked for code ergonomics.

I am skeptical about directly returning a RecordBatchReader as this is not an abstact base class nor an implicit or explicit protocol. Instead it seams more like a type erasure type and I am unsure how I would extend its functionality with things specific to this ODBC reader. E.g. more_results currently allows to jump to the next Result Set from the database, in case the statement or stored procedure return more than one result set.

Currently I think one would still need a way for the user to explicitly opt out of all the ODBC specific functions and transition explicitly to the pyarrow RecordBatchReader. Something like a to_pyarrow_batch_reader. This function could be implemented with what you do now: pyarrow.RecordBatchReader.from_batches(query.schema, query).

Would you consider this an improvement?

Best, Markus

pacman82 commented 7 months ago

Closing this issue. Some final notes on design here, so I remember myself.

Inheriting directly from PyArrows RecordBatchReader would imply using its C-Interface exclusively. It is basically a virtual function table + pointer to the object. This interface is not. So I can not come up with C-Function of my own which I want to support. This is fine to a degree, because I should not be able to extend the contract of a third party library. It is also hard to extend for my specialization of the contract.

Any of this would actually not be a problem, if DuckDb would stick to the generic roots of Python. Ironically DuckDBs python API does not use Duck typing. If so, it could all be solved by choosing the same method names for this library and PyArrow. Instead it seems to me DuckDB chooses to explicitly check for inheritance of that interface.

Here the two philosophies of inherting to be reused and inherting to reuse clash.

There may be a solution space in which I still extend the Arrow C-Interface, by directly accessing the handle pointer in their struct. Yet I would likely need to make assumptions of how the PyArrow interacts with its C-Interface. E.g. in cases there the schema whould change, because we go to the next result set.

For now I find it safer to have an ODBC specific interface and a PyArrow interface and an explicit conversion function between them.

Best, Markus