Closed ypv0rfh closed 4 years ago
The problem is not with the driver, but with pyspark and CTEs. If you check the sources, for the call here:
com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
you will see that resolveTable performs following "template query": https://github.com/apache/spark/blob/695cb617d42507eded9c7e50bc7cd5333bbe6f83/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L113
"SELECT * FROM $table WHERE 1=0"
which for your CTE will not work since you cannot use CTE inside a FROM clause. You probably will need to either avoid ctes or contact pyspark about this issue
Thanks for the explanation @siggemannen. @ypv0rfh do you need further assistance on this?
Thank you for your support.
I have checked with PySpark developers on Stack Overflow.
They have reported the following:
"I see syntax error is coming from your Microsoft SQL server jdbc jar file.. seems like jar file is not supporting syntax or your syntax might be wrong.. spark will support CTE.. I have tested for Oracle, MySQL and postgres it is working."
What Java version / jdbc jar file supports using WITH CTE in SQL Server?
The SQL syntax works fine when using MS SQL Server Management Studio.
Can you post the query that works on MS SQL Server Management Studio? I am interested to see the query PySpark generates, can take a look at it and let you know if it is a valid query.
I post this query in SSMS and it works fine:
with qry1 as(
SELECT TRAOUTP.CUR_WND_DT FROM RDM_APD.TRAOUTP
)
SELECT CUR_WND_DT FROM qry1
``query2 = '''
SELECT TRAOUTP.CUR_WND_DT FROM RDM_APD.TRAOUTP
'''
spark_df = spark.read \ .format("jdbc") \ .option("url", "jdbc:sqlserver://SERVERNAME;databaseName=DATABASE;integratedSecurity=true")\ .option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver")\ .option("query", query2)\ .load()
query1 = '''
with qry1 as( SELECT TRAOUTP.CUR_WND_DT FROM RDM_APD.TRAOUTP )
SELECT CUR_WND_DT FROM qry1
'''
spark_df = spark.read \ .format("jdbc") \ .option("url", "jdbc:sqlserver://SERVERNAME;databaseName=DATABASE;integratedSecurity=true")\ .option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver")\ .option("query", query1)\ .load()
The query you posted is what you feed to PySpark, not what PySpark generates. As @siggemannen mentioned, PySpark executes SELECT * FROM $table WHERE 1=0
to get schema. Use SQL Server Profiler to get the exact query PySpark generates. (Look for SELECT * FROM $table WHERE 1=0
).
The error you posted is a syntax error and it comes from the server, meaning the query was invalid.
Hi @ypv0rfh , Did you have a chance to collect the query generated by PySpark?
Closing due to inactivity. Please feel free to contact us if you need further assistance.
Was anybody able to find any resolution for this? I'm also facing similar issue where 'WITH CTE AS' clause queries are not working with sparksql
@ulvii I do not understand what your solution is for this. Could you please explain it again possibly with a sample?
@ankitbko , This does not seem like a driver issue, I suspect there is an issue with the query the PySpark generates, please read the comments above for explanation. Try gathering the query using SQL Profiler, we can take a look at it.
@ulvii @ankitbko I opened this tracker and was unable to find a solution. Next step is to contact PySpark and ask. Really want to use PySpark with SQL Server and WITH statements are very useful.
I have raised a bug on spark JIRA https://issues.apache.org/jira/browse/SPARK-32825
Did anyone get WITH / CTE SQL queries to work with PySpark and Microsoft SQL Server?
Did anyone get WITH / CTE SQL queries to work with PySpark and Microsoft SQL Server?
Nope, this is an annoying one. because of this we are having to run inner queries that are quite inefficient.
CTE on oracle works fine but on Sql Server it is problematic.
Opened another ticket - https://issues.apache.org/jira/browse/SPARK-34928
+1, CTEs simply don't work with this driver. I do believe it is a driver issue, since no query is actually sent from Spark to SQL Server (a profiler trace reveals no queries are played on the SQL Server instance). The errors we can see on Spark :
Py4JJavaError: An error occurred while calling o420.load.
: com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'.
The reason it works in mysql is that mysql allows "with" to be part of derived table, see https://dev.mysql.com/doc/refman/8.0/en/with.html:
SELECT ... WHERE id IN (WITH ... SELECT ...) ...
SELECT * FROM (WITH ... SELECT ...) AS dt ...
But this syntax ain't valid in MSSQL Server. No jdbc driver can override basic db functionality. The fix is "relatively" simple though, not sure why pyspark are not willing to implement it. In https://github.com/apache/spark/blob/695cb617d42507eded9c7e50bc7cd5333bbe6f83/sql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala one can override the getShemaQuery etc to properly support CTEs as well as basic tables. it could be done like: if table parameter looks like a select just return it, otherwise do the original "template".
@Yasston, @siggemannen, @supunt, @ankitbko, @ypv0rfh
I opened up a new Spark issue hopefully there is some help to get this feature added, otherwise you can patch the code yourself to get these to run. After doing this patches then both the jdbc, in addition to the CTE query the temp table queries also work https://issues.apache.org/jira/browse/SPARK-37259
This was merged into the Spark 3.4.0-SNAPSHOT version, so it will be in Spark 3.4.0 which is supposed to come out next year. We are currently using this in Spark 3.2.1 by applying the changes that @peter-toth had created that is in https://github.com/apache/spark/pull/36440, to the Spark 3.2.1 versions and using that for now.
Thanks everyone for helping to get this created and available
This is one of the top google results for this issue, so to help anyone else who arrives here:
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
see documentation for prepareQuery
Driver version
mssql-jdbc-8.2.2.jre8.jar
SQL Server version
SQL Server 2016
Client Operating System
Windows 10 64 bit
JAVA/JVM version
Java 8
Table schema
NA
Problem description
I am unable to submit a WITH CTE in a SQL Query via JDBC when using PySpark
JDBC trace logs
Py4JJavaError: An error occurred while calling o136.load. : com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'with'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:600) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:522) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7225) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3053) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:444) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Unknown Source)
Reproduction code
Please see stack overflow post: https://stackoverflow.com/questions/61966144/apache-pyspark-jdbc-to-database-using-sql-cte-with-clause/61967266?noredirect=1#comment109639064_61967266