pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.6k stars 17.57k forks source link

Use Turbodbc/Arrow for read_sql_table #17790

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

Currently I believe that the read_sql functions pull data from databases with sqlalchemy, which can be somewhat slow. The Turbodbc library seems to provide an alternate solution that is still ODBC compatible that might operate at higher speed. We might consider adding an engine= keyword to the read_sql functions to allow them to use alternate libraries like this. In this case I would hope that we could pull from the database into Arrow memory and from there to Pandas more efficiently.

The documentation for turbodbc already shows how to do this using their API. There might be some value to integrating this into the Pandas API directly. From my perspective as a Dask developer I would like to use Turbodbc but would prefer that Pandas did the actual wrapping.

I spoke with @jreback about this in person. @MathMagique @xhochy @wesm may also be interested. My apologies if this has already been discussed elsewhere (I was surprised that I couldn't find anything).

http://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#apache-arrow-support https://arrow.apache.org/blog/2017/06/16/turbodbc-arrow/

max-sixty commented 6 years ago

This would be great! This could potentially be a very small wrapper around a Turbodbc -> Arrow -> Pandas, if each of those paths already exist?

(and Turbodbc doesn't have the same level of support as SQLAlchemy, so we'd need to keep that an option)

jorisvandenbossche commented 6 years ago

I actually just played with this last weekend, trying to get some idea of potential speed-up (as I am very enthusiastic about the potential!). A very basic implementation is just:

def read_sql_turbo(query, conn):
    cursor = conn.cursor()
    cursor.execute(query)
    res = cursor.fetchallarrow().to_pandas()
    cursor.close()
    return res

So it would certainly be not that hard to add something (more expanded) like that as an optional engine. The nice thing is that arrow can take care of interpreting the data types (database types <-> pandas types), which is what we otherwise use sqlalchemy for.

The only downsides I noticed during my playing around (which are not necessarily reasons not to add this, as long as it is optional):

Side note: our current sql i/o functions are indeed using sqlalchemy, and can indeed by quite slow, but as far as I know this is not due to using sqlalchemy (because we are only using sqlalchemy core, not orm, which gives little overhead), but mainly due to slow actual database drivers (or config settings).

wesm commented 6 years ago

In case it's interesting, I would like to build a native sqlite3->arrow reader, which should yield a pretty major speedup for the sqlite use case. Same for libpq. ODBC turns out to be a pretty heavy middleware -- vendors who have ODBC as the only way to push data in and out have optimized things a lot, whereas postgres/sqlite have not (downside of OSS at times, I guess)

MathMagique commented 6 years ago

I never would have dared to open an issue like this ;-). I agree that the feature needs to be optional, seeing that ODBC is a mouthful to set up initially and ODBC driver performance varies heavily from implementation to implementation. Enforcing turbodbc would be unacceptable.

jbrockmendel commented 1 year ago

@phofl is this one of the places where a pyarrow-related keyword is going to solve things?

phofl commented 1 year ago

I guess, but Alchemy is slow in general when reading data. So it should also be an improvement when converting to NumPy dtypes afterwards. Maybe more like engine in read_csv and read_json?

WillAyd commented 8 months ago

If this issue is strictly about Arrow (de-)serialization then it should be covered by https://github.com/pandas-dev/pandas/issues/53869

Of course that PR is strictly about ADBC. If we wanted Turbodbc still I think the same development patterns would apply