modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.76k stars 651 forks source link

Incorrect syntax error on read_sql #979

Closed zacqed closed 2 years ago

zacqed commented 4 years ago

I am getting the following error in case of msssql.

Connection query details:

params1 = urllib.parse.quote_plus("driver={ODBC Driver 17 for SQL Server};server=server;database=Kepler;Uid=uid;Pwd=password;Encrypt=yes;")
engine2 = 'mssql+pyodbc:///?odbc_connect={}'.format(params1)

Error:

sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Incorrect syntax near ')'. (102) (SQLExecDirectW)")
[SQL: SELECT COUNT(*) FROM (select * from cashflow_data as foo)]

Minimal code is as follows:

import modin.experimental.pandas as pd
query = f'''select * from cashflow_data'''
file_data = pd.read_sql(sql = query, con=engine2,max_sessions = 10)
y = datetime.datetime.now() - x

Dataset example:

instu_id cashflow_amount currency_code cf_type data_date import_source entity maturity_date
6.03675E+11 7500 KWD P 30-Sep-19 MANUAL RFOL 1-May-20
6.03468E+11 635.143 KWD P 30-Sep-19 MANUAL RFOL 30-Sep-19
6.03474E+11 20000 KWD P 30-Sep-19 MANUAL RFOL 31-Oct-19
6.03E+11 5149 KWD P 30-Sep-19 MANUAL RFOL 18-Feb-20
6.03511E+11 16635 KWD P 30-Sep-19 MANUAL RFOL 30-Oct-19

@devin-petersohn : Any way to fix it

devin-petersohn commented 4 years ago

Hi @Harshitg, thanks for posting! This looks like something we're going to have to fix internally.

It will take me some time to set up a SQLServer environment to test this locally. I have a suspicion that the way we are aliasing our subqueries for postgres does not translate, so it's going to take some time to find a way to detect the type of database and fix the issue. I will make sure it is out for the next release.

eavidan commented 4 years ago

SQL SERVER requires aliasing the inner query. For instance, the following should work

SELECT COUNT(*) FROM (select * from cashflow_data) as foo

I believe this was the intent from the start, but the alias snuck into the inner query. or am I missing something bigger here?

devin-petersohn commented 4 years ago

@eavidan I believe you are correct. I wanted to set up a testing environment to verify this. Postgres also has this requirement, so we need to make sure that the fix doesn't break the postgres functionality.

eavidan commented 4 years ago

Thinking about it, Microsoft SQL does not support LIMIT and OFFSET as well. So this means that in order to support Microsoft SQL we would probably need to generalize the way we construct our queries better.

eavidan commented 4 years ago

@devin-petersohn After diving into SQL Alchemy and pandas implementation, I believe we would need to created different query templates for different SQL engines. we can add this to sql_reader.py. I am not sure there is a simple way to implement the distribution of the query in a unified way across all SQL engines. does this make sense? any other directions you had in mind?

devin-petersohn commented 4 years ago

Thanks @eavidan that makes sense, I think that's the best approach. As a part of this, we will need to add the different SQL engines to the CI. There's going to be a little setup required to do that.

devin-petersohn commented 4 years ago

Since this is a slightly larger rewrite, I'm pushing to the next release, which should be in 3-4 weeks, possibly sooner.

eavidan commented 4 years ago

@devin-petersohn sounds reasonable. I have tried working with SQL Alchemy to create a query generator that would wrap the given sql query, while being agnostic to the SQL engine. Unfortunately, I was not able to produce a complete solution. Going back to the last approach, do we want to simply check the connection string or engine dialect and choose a query template accordingly. something like this?

row_cnt_template = "SELECT COUNT(*) FROM ({}) as foo"
cols_names_template = "SELECT * FROM ({0}) as foo LIMIT 0"
partition_template = "SELECT * FROM ({0}) as foo LIMIT {1} OFFSET {2}"

engine = sa.create_engine(con)
if engine.driver == "pymssql":
    partition_template = "SELECT * FROM ({0}) as foo ORDER BY(SELECT NULL) OFFSET {2} ROWS FETCH NEXT {1} ROWS ONLY"
    cols_names_template = "SELECT TOP 0 FROM ({0}) as foo"

If so, would a generator for the templates be more suited and should we start only with a Base template and a MSSQL templates?

devin-petersohn commented 4 years ago

Yes, that should be good. Ideally this code block would go in a util of some sort outside of the implementation so it's easy to find/add new templates if necessary.

Ideally, as a part of this effort we can add better testing for the various SQL systems. I have set up a Windows test in the CI so we can add mssql to the testing environment there.

eavidan commented 4 years ago

I agree. started working on simple query generator as I believe it will make things more robust and clear instead of supplying templates. Still not sure about the. naming part or if utils is the right file for it. https://github.com/eavidan/modin/tree/issue_979/modin/engines/base/io/sql any thoughts? would you rather we open a WIP PR and conduct this discussion over there?

devin-petersohn commented 4 years ago

@eavidan Let's do a WIP PR and have the discussion there.

devin-petersohn commented 4 years ago

We have an open PR for this in progress, but the design is not finalized, pushing to next release.

SudharsanRama commented 4 years ago

I'm getting a similar error with Oracle.

pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT COUNT(*) FROM (select * from FROM fusion.fnd_flex_value_sets ffvs) as foo': ORA-00933: SQL command not properly ended

Here's my Modin setup:

from distributed import Client
client = Client(n_workers=6)
# Modin will connect to the Dask Client
import modin.pandas as pd

Apparently, oracle does not support table aliases with 'as' in front, hence the SQL command error.

devin-petersohn commented 4 years ago

Thanks @SudharsanRama, we are planning on getting this fix out for all databases in the next release.

As a temporary workaround, you can pass a SQLAlchemy connection to the pd.read_sql command. It will not run in parallel, but it should work.

eavidan commented 4 years ago

@SudharsanRama if I am not mistaken you have an error in your SQL syntax. It looks like you have a couple of FROM clauses.

select * from FROM fusion.fnd_flex_value_sets ffvs
SudharsanRama commented 4 years ago

@SudharsanRama if I am not mistaken you have an error in your SQL syntax. It looks like you have a couple of FROM clauses.

select * from FROM fusion.fnd_flex_value_sets ffvs

Yup. Sorry for the typo. I didn't want to put the whole query in there. But you get the idea.

thoughtnewbie commented 3 years ago

@eavidan I tried the approach you described, but I'm getting an exception. Below is my code snippet

` import modin.experimental.pandas as pd

SQL = "SELECT * FROM my_table WHERE my_date = '20200304'"

CONN = f'mssql+pyodbc://{getenv("UID")}:{getenv("PWD")}@{getenv("SERVER")}/DB_TEST?driver=ODBC Driver 17 for SQL Server'

pd.read_sql(sql=SQL, con=CONN) sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server] [SQL Server]Incorrect syntax near 'LIMIT'. (102) (SQLExecDirectW)") [SQL: SELECT FROM (SELECT FROM my_table WHERE my_date = '20200304') as foo LIMIT 0] (Background on this error at: http://sqlalche.me/e/13/f405) `

anil-kk commented 3 years ago

@eavidan I tried the approach you described, but I'm getting an exception. Below is my code snippet

` import modin.experimental.pandas as pd

SQL = "SELECT * FROM my_table WHERE my_date = '20200304'"

CONN = f'mssql+pyodbc://{getenv("UID")}:{getenv("PWD")}@{getenv("SERVER")}/DB_TEST?driver=ODBC Driver 17 for SQL Server'

pd.read_sql(sql=SQL, con=CONN) ` sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 17 for SQL Server] [SQL Server]Incorrect syntax near 'LIMIT'. (102) (SQLExecDirectW)") [SQL: SELECT * FROM (SELECT * FROM my_table WHERE my_date = '20200304') as foo LIMIT 0] (Background on this error at: http://sqlalche.me/e/13/f405)

I am getting the same error message.

devin-petersohn commented 3 years ago

I think it is possible to fix this in a database-agnostic way. Marking for the next release.

abjess commented 2 years ago

Hi, I am getting a similar error when trying to read from an oracle database.

mock code: import modin.pandas as pd query = """select distinct col_a, col_b FROM schema.table""" pd.read_sql(query, conn)

error: sqlalchemy.exc.DatabaseError: (cx_Oracle.DatabaseError) ORA-00933: SQL command not properly ended [SQL: SELECT COUNT(*) FROM (select distinct col_a, col_b FROM schema.table) AS _

mvashishtha commented 2 years ago

@abjess thank you for reporting your bug! Would you be able to open a new Modin issue describing your bug in some more detail? It would be great if you could provide as much information as you can there about your connection object or string. If you're using a sqlalchemy string, it's especially helpful to know the dialect and driver.

abjess commented 2 years ago

@mvashishtha yeah i will thanks

nadeem4 commented 1 year ago

Hi,

I am still getting similar issue with Azure Synapse Serverless Database. However I don't think Serverless SQL should create any difference because in the db_conn.py , dialect is being set to Microsoft SQL Server based on drivers (pymssql or Pyodbc). image

Library Versions:

Library: | Version -- | -- Python: | 3.7.14 Ray: | 2.0.0 Modin: | 0.12.1

Code:

import modin.pandas as pd import sqlalchemy as sa from modin.db_conn import ModinDatabaseConnection import urllib

params = urllib.parse.quote_plus('DRIVER='+driver+';SERVER=tcp:'+server+';PORT=1433;DATABASE='+database+';UID=' + username+';PWD=' + password + ';Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;MARS_Connection=Yes')

table_name = "dbo.test" query = f"SELECT * FROM {table_name}" sqlalchemy_connection_string = ( "mssql+pyodbc:///?odbc_connect=%s" % params )

modin_df = pd.read_sql( query, ModinDatabaseConnection("sqlalchemy", sqlalchemy_connection_string), ) Error:

image

mvashishtha commented 1 year ago

@nadeem4 thanks for your comment about SQL server support, which was a separate issue, #3524. Modin version 0.12.1 came out on December 18 2021, so it doesn't include commit ee2440c53a1e3bd47736776e7c643f05c4a0db70, the fix for #3524, because that commit was merged into Modin on March 10 2022.

Are you able to upgrade to the latest version of Modin, 0.15.3? Modin 0.15.3 requires python version at least 3.8.

matquant14 commented 1 year ago

Hi @devin-petersohn and @mvashishtha.

Has this been fully addressed? I think there's a flaw w/ your implementation for SQL Server when the SQL query includes an ORDER BY clause. T-SQL doesn't permit ORDER BY clauses in subqueries.

If I create a SQLAlchemy engine and pass it to the read_sql function, Modin reverts to pandas to extract the data from the database.

But if I pass a connection string to use the ModinDatabaseConnection class, then the row_count_query method breaks because of the ORDER BY clause contained w/in the original SQL query.

Now if I remove the ORDER BY clause in the SQL text and pass a connection string, then it seems that the read_sql function just hangs.

I don't experience this same behavior if I use something like polars' read_sql method. Polars uses connectorx as it's database engine, not SQLAlchemy. The downside is that connectorx doesn't support Snowflake

Thoughts?