microsoft / sql-spark-connector

Apache Spark Connector for SQL Server and Azure SQL
Apache License 2.0
273 stars 118 forks source link

JDBC driver doesn't support CTEs #147

Closed Yasston closed 2 years ago

Yasston commented 3 years ago

Driver version

Driver version : latest (9.4.0). It is actually the driver that is natively included with Synapse Spark pool

SQL Server version

Microsoft SQL Azure (RTM) - 12.0.2000.8

Client Operating System

Azure Synapse Apache Spark pool (Spark v2 and v3 are impacted)

JAVA/JVM version

the JVM natively included with Azure Synapse Apache Spark pool (Spark v2 and v3 are impacted)

Problem description

  1. Expected behaviour: queries with CTEs should execute normally and provide the same results as when they are being run from SSMS

  2. Actual behaviour: Any SQL query containing one or more CTEs do not run, on Synapse Apache Spark we get the following error when trying to run a query with "WITH Cte_Name AS (SELECT Something FROM AnyTable) SELECT * FROM Cte_Name"

  3. Error message/stack trace:

    
    Py4JJavaError: An error occurred while calling o217.load.
    : com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'.
    at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
    at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
    at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
    at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
    at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7375)
    at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3206)
    at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247)
    at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222)
    at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:446)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last):

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 210, in load return self._df(self._jreader.load())

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw)

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError(

py4j.protocol.Py4JJavaError: An error occurred while calling o217.load. : com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7375) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3206) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:446) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)



4. Any other details that can be helpful:
The query doesn't even get run the SQL Server instance, when trying to capture the query using the Profiler, we can't see any query being executed.

A wrongfully Closed github issue was also declared here on the "mssql-jdbc" project, with several commentators :
https://github.com/microsoft/mssql-jdbc/issues/1340
I declared a new issue there, since it seems connected : https://github.com/microsoft/mssql-jdbc/issues/1657 

## Reproduction code
Try running ANY sql query with a CTE, for example the one in my previous example
Yasston commented 3 years ago

See also here : https://issues.apache.org/jira/browse/SPARK-34928

Yasston commented 3 years ago

Adding concrete example :

Perfectly working example in SSMS : image

Same query in Synapse Spark pool using the SQL Server jdbc driver (cf. issue I declared here : https://github.com/microsoft/sql-spark-connector/issues/147) : image

Same query but without CTE, running fine in Synapse Spark pool using the SQL Server jdbc driver : image

As you can see, it's pretty easy to reproduce the issue.

Hope this helps. Thank you

luxu1-ms commented 2 years ago

@Yasston As mentioned in https://github.com/microsoft/mssql-jdbc/issues/1340, Spark Jdbc driver and spark mssql connector will run querys like "SELECT * FROM ${table} WHERE 1=0;" to get schema ect during df write to sql table. CTE syntax does not work with those sql querys in both spark and mssql connector. ${table} is the table name or the query in your script. Please also note that mssql connector delegates read to spark JDBC driver.

KevinAppelBofa commented 2 years ago

@Yasston

I opened up a new Spark issue hopefully there is some help to get this feature added, otherwise you can patch the code yourself to get these to run. This is the SQL Spark connector is piggy back on the Spark JDBC, so since the JDBC isn't working then neither does this. After doing this patches then both the jdbc and the sql spark connector is working. https://issues.apache.org/jira/browse/SPARK-37259

RandomSqlDeveloper commented 2 years ago

I have been searching for a solution to the exact same problem for hours, all responses lead to: JDBC read is always going to wrap the query in a select statement

That's true if the I am using the option "dbtable", however, I am using the option "query" this exact same syntax works perfectly for Postgres, but, not SQL Server

can the same steps be applied to both?

df = spark.read.format("jdbc")\
          .option("driver", driver)\
          .option("url", url)\
          .option("query", table)\
          .option("user", user)\
          .option("password", password)\
          .options(**ssl_args)\`
          .load()
KevinAppelBofa commented 2 years ago

Hey @randomsqldeveloper i left this comment also in https://github.com/microsoft/mssql-jdbc/issues/1340, the spark issue was closed and fixed by @peter-toth recently, in the PR https://github.com/apache/spark/pull/36440 and is in the spark 3.4.0-snapshot currently. If you check in the PR it has an example how to use it Besides this allowing the sql server to work for the cte queries, we also noticed this also makes impala cte queries work as well now, impala has another unrelated issue but if you fix that quote character with impala issue then it is working There is the new parameter you can add that will run before the select statement, that may help with your issue; this is working in pyspark as well, the example in the PR is scala code but the same parameters is also working in python

luxu1-ms commented 2 years ago

Close this issue as it is related the fix in spark.