microsoft / mssql-jdbc

The Microsoft JDBC Driver for SQL Server is a Type 4 JDBC driver that provides database connectivity with SQL Server through the standard JDBC application program interfaces (APIs).
MIT License
1.06k stars 424 forks source link

JDBC Support Using WITH CTE #1340

Closed ypv0rfh closed 4 years ago

ypv0rfh commented 4 years ago

Driver version

mssql-jdbc-8.2.2.jre8.jar

SQL Server version

SQL Server 2016

Client Operating System

Windows 10 64 bit

JAVA/JVM version

Java 8

Table schema

NA

Problem description

I am unable to submit a WITH CTE in a SQL Query via JDBC when using PySpark

JDBC trace logs

Py4JJavaError: An error occurred while calling o136.load. : com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'with'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:600) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7225) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3053) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Unknown Source)

Reproduction code

Please see stack overflow post: https://stackoverflow.com/questions/61966144/apache-pyspark-jdbc-to-database-using-sql-cte-with-clause/61967266?noredirect=1#comment109639064_61967266

siggemannen commented 4 years ago

The problem is not with the driver, but with pyspark and CTEs. If you check the sources, for the call here:

com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)

you will see that resolveTable performs following "template query": https://github.com/apache/spark/blob/695cb617d42507eded9c7e50bc7cd5333bbe6f83/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L113

"SELECT * FROM $table WHERE 1=0"

which for your CTE will not work since you cannot use CTE inside a FROM clause. You probably will need to either avoid ctes or contact pyspark about this issue

ulvii commented 4 years ago

Thanks for the explanation @siggemannen. @ypv0rfh do you need further assistance on this?

ypv0rfh commented 4 years ago

Thank you for your support.

I have checked with PySpark developers on Stack Overflow.

https://stackoverflow.com/questions/61966144/apache-pyspark-jdbc-to-database-using-sql-cte-with-clause?noredirect=1#comment109653036_61966144

They have reported the following:

"I see syntax error is coming from your Microsoft SQL server jdbc jar file.. seems like jar file is not supporting syntax or your syntax might be wrong.. spark will support CTE.. I have tested for Oracle, MySQL and postgres it is working."

What Java version / jdbc jar file supports using WITH CTE in SQL Server?

The SQL syntax works fine when using MS SQL Server Management Studio.

ulvii commented 4 years ago

Can you post the query that works on MS SQL Server Management Studio? I am interested to see the query PySpark generates, can take a look at it and let you know if it is a valid query.

ypv0rfh commented 4 years ago

I post this query in SSMS and it works fine:

with qry1 as(
SELECT TRAOUTP.CUR_WND_DT FROM RDM_APD.TRAOUTP 
)

SELECT CUR_WND_DT FROM qry1

Spark Query That Works

``query2 = '''

SELECT TRAOUTP.CUR_WND_DT FROM RDM_APD.TRAOUTP

'''

spark_df = spark.read \ .format("jdbc") \ .option("url", "jdbc:sqlserver://SERVERNAME;databaseName=DATABASE;integratedSecurity=true")\ .option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver")\ .option("query", query2)\ .load()

Spark Query WITH CTE that doesnt work

query1 = '''

with qry1 as( SELECT TRAOUTP.CUR_WND_DT FROM RDM_APD.TRAOUTP )

SELECT CUR_WND_DT FROM qry1

'''

spark_df = spark.read \ .format("jdbc") \ .option("url", "jdbc:sqlserver://SERVERNAME;databaseName=DATABASE;integratedSecurity=true")\ .option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver")\ .option("query", query1)\ .load()

ulvii commented 4 years ago

The query you posted is what you feed to PySpark, not what PySpark generates. As @siggemannen mentioned, PySpark executes SELECT * FROM $table WHERE 1=0 to get schema. Use SQL Server Profiler to get the exact query PySpark generates. (Look for SELECT * FROM $table WHERE 1=0). The error you posted is a syntax error and it comes from the server, meaning the query was invalid.

ulvii commented 4 years ago

Hi @ypv0rfh , Did you have a chance to collect the query generated by PySpark?

ulvii commented 4 years ago

Closing due to inactivity. Please feel free to contact us if you need further assistance.

sg11pankaj commented 4 years ago

Was anybody able to find any resolution for this? I'm also facing similar issue where 'WITH CTE AS' clause queries are not working with sparksql

ankitbko commented 4 years ago

@ulvii I do not understand what your solution is for this. Could you please explain it again possibly with a sample?

ulvii commented 4 years ago

@ankitbko , This does not seem like a driver issue, I suspect there is an issue with the query the PySpark generates, please read the comments above for explanation. Try gathering the query using SQL Profiler, we can take a look at it.

ypv0rfh commented 4 years ago

@ulvii @ankitbko I opened this tracker and was unable to find a solution. Next step is to contact PySpark and ask. Really want to use PySpark with SQL Server and WITH statements are very useful.

ankitbko commented 4 years ago

I have raised a bug on spark JIRA https://issues.apache.org/jira/browse/SPARK-32825

ypv0rfh commented 3 years ago

Did anyone get WITH / CTE SQL queries to work with PySpark and Microsoft SQL Server?

supunt commented 3 years ago

Did anyone get WITH / CTE SQL queries to work with PySpark and Microsoft SQL Server?

Nope, this is an annoying one. because of this we are having to run inner queries that are quite inefficient.

CTE on oracle works fine but on Sql Server it is problematic.

Opened another ticket - https://issues.apache.org/jira/browse/SPARK-34928

Yasston commented 3 years ago

+1, CTEs simply don't work with this driver. I do believe it is a driver issue, since no query is actually sent from Spark to SQL Server (a profiler trace reveals no queries are played on the SQL Server instance). The errors we can see on Spark :

Py4JJavaError: An error occurred while calling o420.load.
: com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'.
siggemannen commented 3 years ago

The reason it works in mysql is that mysql allows "with" to be part of derived table, see https://dev.mysql.com/doc/refman/8.0/en/with.html:

SELECT ... WHERE id IN (WITH ... SELECT ...) ...
SELECT * FROM (WITH ... SELECT ...) AS dt ...

But this syntax ain't valid in MSSQL Server. No jdbc driver can override basic db functionality. The fix is "relatively" simple though, not sure why pyspark are not willing to implement it. In https://github.com/apache/spark/blob/695cb617d42507eded9c7e50bc7cd5333bbe6f83/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala one can override the getShemaQuery etc to properly support CTEs as well as basic tables. it could be done like: if table parameter looks like a select just return it, otherwise do the original "template".

KevinAppelBofa commented 2 years ago

@Yasston, @siggemannen, @supunt, @ankitbko, @ypv0rfh

I opened up a new Spark issue hopefully there is some help to get this feature added, otherwise you can patch the code yourself to get these to run. After doing this patches then both the jdbc, in addition to the CTE query the temp table queries also work https://issues.apache.org/jira/browse/SPARK-37259

KevinAppelBofa commented 2 years ago

This was merged into the Spark 3.4.0-SNAPSHOT version, so it will be in Spark 3.4.0 which is supposed to come out next year. We are currently using this in Spark 3.2.1 by applying the changes that @peter-toth had created that is in https://github.com/apache/spark/pull/36440, to the Spark 3.2.1 versions and using that for now.

Thanks everyone for helping to get this created and available

dan-marra-ebrs commented 2 months ago

This is one of the top google results for this issue, so to help anyone else who arrives here:

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

see documentation for prepareQuery