modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.59k stars 649 forks source link

Feature Request: Support read_sql_query #3233

Open mflaxman10 opened 2 years ago

mflaxman10 commented 2 years ago

This is a feature request and not a bug, so would apply to any version.

It is to support the "read_sql_query" method which has been in pandas since near the beginning (0.17 at least). Optional, but very useful in this case would be to support the "chunksize" parameter in a fashion appropriate to Modin.

The use case is simple and common: you've got data you'd like to access in some database, like postGIS or OmniSciDB, which supports SQLAlchemy connections. Typically its a subset of something huge which you'd like for some ML. In the OmniSci case, the data are natively on GPU, so you want to do a fast subset there to minimize the data in motion (such as a 'push down' filter on LIDAR points within a bbox or geographic shape)

This can only be done today with a hack such as that below from https://blog.panoply.io/how-to-read-a-sql-query-into-a-pandas-dataframe

def read_sql_tmpfile(query, db_engine): with tempfile.TemporaryFile() as tmpfile: copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format( query=query, head="HEADER" ) conn = db_engine.raw_connection() cur = conn.cursor() cur.copy_expert(copy_sql, tmpfile) tmpfile.seek(0) df = pandas.read_csv(tmpfile) return df

devin-petersohn commented 2 years ago

Query pushdown is something we are interested in pursuing. It is a large undertaking to both queue up operations and pushdown their SQL equivalents, but I think it can be done.