pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
54 stars 5 forks source link

turbodbc is faster at downloading #72

Closed jonashaag closed 7 months ago

jonashaag commented 8 months ago

In my benchmarks, it seems like Turbodbc with use_async_io=True is 20–30% faster than arrow-odbc-py with fetch_concurrently().

I haven't done any profiling on this yet.

jonashaag commented 8 months ago

Btw, here https://github.com/pacman82/arrow-odbc-py/issues/47#issuecomment-1661655693 you @pacman82 suggested to use odbc2parquet if you want to go from ODBC to Parquet without the Arrow intermediary. To me this implies that it's also faster than going through arrow-odbc-py. In my benchmarks with comparable settings however, odbc2parquet is ~ 20% slower than going through arrow-odbc-py with fetch_concurrently() + pyarrow.parquet.

jonashaag commented 8 months ago

Update: Deleted invalid profiling results.

jonashaag commented 8 months ago

On another query arrow-odbc is much faster. Interesting.

pacman82 commented 7 months ago

Faster is always better, but I am closing this issue. Not sure what the definition of done here would be.