Closed Yasston closed 2 years ago
See also here : https://issues.apache.org/jira/browse/SPARK-34928
Adding concrete example :
Perfectly working example in SSMS :
Same query in Synapse Spark pool using the SQL Server jdbc driver (cf. issue I declared here : https://github.com/microsoft/sql-spark-connector/issues/147) :
Same query but without CTE, running fine in Synapse Spark pool using the SQL Server jdbc driver :
As you can see, it's pretty easy to reproduce the issue.
Hope this helps. Thank you
@Yasston As mentioned in https://github.com/microsoft/mssql-jdbc/issues/1340, Spark Jdbc driver and spark mssql connector will run querys like "SELECT * FROM ${table} WHERE 1=0;" to get schema ect during df write to sql table. CTE syntax does not work with those sql querys in both spark and mssql connector. ${table} is the table name or the query in your script. Please also note that mssql connector delegates read to spark JDBC driver.
@Yasston
I opened up a new Spark issue hopefully there is some help to get this feature added, otherwise you can patch the code yourself to get these to run. This is the SQL Spark connector is piggy back on the Spark JDBC, so since the JDBC isn't working then neither does this. After doing this patches then both the jdbc and the sql spark connector is working. https://issues.apache.org/jira/browse/SPARK-37259
I have been searching for a solution to the exact same problem for hours, all responses lead to: JDBC read is always going to wrap the query in a select statement
That's true if the I am using the option "dbtable", however, I am using the option "query" this exact same syntax works perfectly for Postgres, but, not SQL Server
can the same steps be applied to both?
df = spark.read.format("jdbc")\
.option("driver", driver)\
.option("url", url)\
.option("query", table)\
.option("user", user)\
.option("password", password)\
.options(**ssl_args)\`
.load()
Hey @randomsqldeveloper i left this comment also in https://github.com/microsoft/mssql-jdbc/issues/1340, the spark issue was closed and fixed by @peter-toth recently, in the PR https://github.com/apache/spark/pull/36440 and is in the spark 3.4.0-snapshot currently. If you check in the PR it has an example how to use it Besides this allowing the sql server to work for the cte queries, we also noticed this also makes impala cte queries work as well now, impala has another unrelated issue but if you fix that quote character with impala issue then it is working There is the new parameter you can add that will run before the select statement, that may help with your issue; this is working in pyspark as well, the example in the PR is scala code but the same parameters is also working in python
Close this issue as it is related the fix in spark.
Driver version
Driver version : latest (9.4.0). It is actually the driver that is natively included with Synapse Spark pool
SQL Server version
Microsoft SQL Azure (RTM) - 12.0.2000.8
Client Operating System
Azure Synapse Apache Spark pool (Spark v2 and v3 are impacted)
JAVA/JVM version
the JVM natively included with Azure Synapse Apache Spark pool (Spark v2 and v3 are impacted)
Problem description
Expected behaviour: queries with CTEs should execute normally and provide the same results as when they are being run from SSMS
Actual behaviour: Any SQL query containing one or more CTEs do not run, on Synapse Apache Spark we get the following error when trying to run a query with "WITH Cte_Name AS (SELECT Something FROM AnyTable) SELECT * FROM Cte_Name"
Error message/stack trace:
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 210, in load return self._df(self._jreader.load())
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o217.load. : com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7375) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3206) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:446) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:61) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)