modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.92k stars 653 forks source link

BUG: SQL Text Query #6220

Open mgroholski opened 1 year ago

mgroholski commented 1 year ago

Modin version checks

Reproducible Example

import modin.pandas as pd
CONNECTION_STRING = "..." #some mysql database
df = pd.read_sql("SELECT ebdb.likes.user_id, ebdb.likes.clothing_id, ebdb.likes.rating FROM ebdb.likes", CONNECTION_STRING)

Issue Description

The count query to divide the data into workers is not wrapped in sqlalchemy.text()

Expected Behavior

Return data frame of sql data.

Error Logs

```python-traceback Traceback (most recent call last): File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/starlette/routing.py", line 677, in lifespan async with self.lifespan_context(app) as maybe_state: File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/starlette/routing.py", line 566, in __aenter__ await self._router.startup() File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/starlette/routing.py", line 654, in startup await handler() File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/src/main.py", line 52, in startup await asyncio.gather(loadDatabase(), getRatings()) File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/src/main.py", line 124, in loadDatabase df = pd.read_sql("SELECT ebdb.likes.user_id, ebdb.likes.clothing_id, ebdb.likes.rating FROM ebdb.likes", CONNECTION_STRING) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/pandas/io.py", line 619, in read_sql return DataFrame(query_compiler=FactoryDispatcher.read_sql(**kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/core/execution/dispatching/factories/dispatcher.py", line 255, in read_sql return cls.get_factory()._read_sql(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/core/execution/dispatching/factories/factories.py", line 333, in _read_sql return cls.io_cls.read_sql(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/core/io/file_dispatcher.py", line 157, in read query_compiler = cls._read(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/modin/core/io/sql/sql_dispatcher.py", line 73, in _read row_cnt = pandas.read_sql(row_count_query, connection_for_pandas).squeeze() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/pandas/io/sql.py", line 590, in read_sql return pandas_sql.read_query( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/pandas/io/sql.py", line 1560, in read_query result = self.execute(*args) ^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/pandas/io/sql.py", line 1405, in execute return self.connectable.execution_options().execute(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mattgroholski/Documents/My Documents/BubbaTech.nosync/reccomendationSystem/.venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1411, in execute raise exc.ObjectNotExecutableError(statement) from err sqlalchemy.exc.ObjectNotExecutableError: Not an executable object: 'SELECT COUNT(*) FROM (SELECT ebdb.likes.user_id, ebdb.likes.clothing_id, ebdb.likes.rating FROM ebdb.likes) AS _MODIN_COUNT_QUERY' ```

Installed Versions

INSTALLED VERSIONS ------------------ commit : e8e57d96edf1c2ca874c685def859d0b1ca603da python : 3.11.3.final.0 python-bits : 64 OS : Darwin OS-release : 22.5.0 Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 Modin dependencies ------------------ modin : 0.21.0 ray : None dask : 2023.5.1 distributed : 2023.5.1 hdk : None pandas dependencies ------------------- pandas : 1.5.3 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 67.6.1 pip : 23.1.2 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2023.5.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.10.1 snappy : None sqlalchemy : 2.0.15 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
mgroholski commented 1 year ago

Should be fixed by editing two files:

modin/core/storage_formats/pandas/parsers.py Line 884: df = pandas.read_sql(sqlalchemy.text(sql), con, index_col=index_col, **kwargs)

modin/core/io/sql/sql_dispatcher.py Line 72: row_count_query = sqlalchemy.text(con.row_count_query(sql)) Line 74: colum_names_query = sqlalchemy.text(con.column_names_query(sql))