microsoft / mssql-jdbc

The Microsoft JDBC Driver for SQL Server is a Type 4 JDBC driver that provides database connectivity with SQL Server through the standard JDBC application program interfaces (APIs).
MIT License
1.06k stars 424 forks source link

Azure Synapse connector post actions which internally uses mssql driver fail when using MERGE statement #1600

Closed ChetanKardekar closed 3 years ago

ChetanKardekar commented 3 years ago

Problem description

The Synapse connector uses the mssql driver to connect to Synapse and issue SQL commands Azure Synapse connector provides a feature called postActions. Post action is any SQL command which will run immediately after the write is completed. The below example explains how to use postAction

df.write \ .format("com.databricks.spark.sqldw") \ .option("url", "jdbc:sqlserver://<>;database=<<>DATABASE>;user=<>;password=<>;encrypt=true;trustServerCertificate=false;loginTimeout=30") \ .option("forwardSparkAzureStorageCredentials", "true") \ .option("dbTable", "nameTab") \ .mode("overwrite")\ .option('maxStrLength',4000) \ .option('postActions', postActions) \ .option("tempDir", "abfss://<>/tmp") \ .save()

Post action is working for all the other SQL commands like select, insert, update etc but it is not working for the Merge command like below

postActions = "MERGE into nameTab_target AS target USING nameTab AS source ON target.ID = source.ID WHEN MATCHED THEN UPDATE SET target.name = source.name"

  1. Expected behaviour: merge command should work in post action

  2. Actual behaviour: Merge command not working and failing with below error

  3. Error message/stack trace:

Py4JJavaError: An error occurred while calling o1058.save. : com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Underlying SQLException(s):

  1. Any other details that can be helpful: The Synapse connector uses the mssql driver to connect to Synapse and issue SQL commands

Reproduction code

import java.sql.{Connection, DriverManager}

var conn: Connection = null try { conn = DriverManager.getConnection("jdbc:sqlserver://...") val query = "MERGE into nameTab_target AS target USING nameTab AS source ON target.ID = source.ID WHEN MATCHED THEN UPDATE SET target.name = source.name;" val statement = conn.prepareStatement(query) try { statement.execute() } finally { statement.close() } } finally { if (conn != null) { conn.close() } }

we use 8.2.2.jre8 SQL Server driver to connect to Synapse. The problem seems to be in the driver that does not handle MERGE command correctly without the ending ;.

You can use the following code to create tables and insert data in SQL DW: create table nameTab_target(ID int, name varchar(128)); insert into nameTab_target values (1, '?'); insert into nameTab_target values (2, '?'); insert into nameTab_target values (3, '?'); insert into nameTab_target values (4, '?');

create table nameTab(ID int, name varchar(128)); insert into nameTab values (1, 'A'); insert into nameTab values (2, 'B');

If you remove ; at the end of the MERGE statement, this code will fail with com.microsoft.sqlserver.jdbc.SQLServerException: Parse error at line: 1, column: 140: Incorrect syntax near 'name'..

Note that other statements work correctly, e.g. both INSERT INTO nameTab_target values (6, '?'); and INSERT INTO nameTab_target values (6, '?') work fine.

The reason this would still fail is splitting by ; in the Synapse connector code, so any semicolons at the end of a statement will be removed.

lilgreenbird commented 3 years ago

hi @ChetanKardekar

Looking at the doc for postActions it specifies that it is "A ; separated list of SQL commands..." it seems that the ; is required?

Screen Shot 2021-06-16 at 6 59 02 PM

Have you tried this with another driver (e.g. .NET) to see if you get the same result?

lilgreenbird commented 3 years ago

closing due to inactivity, please request to re-open issue if necessary

jelther commented 2 years ago

Me and my team are having the exact same issue. The SQL statement that's executed by the JDBC connector works when using SSMS. Maybe it's due to the fact that the MERGE statement is still a preview for Azure Synapse ?

cynthiajiangatl commented 2 years ago

I ran into the same issue, merge statement works fine in Synapse studio, but it does not work with databricks spark connector writing to Synapse as a postaction.

df.write \ .format("com.databricks.spark.sqldw") \ .option("url", synapse_JDBCURL) \ .option("user", dbutils.secrets.get(scope = secret_Scope, key = secret_SQLDBUser)) \ .option("password", dbutils.secrets.get(scope = secret_Scope, key = secret_SQLDBPasswd)) \ .option("tempDir", "abfss://" + file_System + "@" + storage_Account + ".dfs.core.windows.net/" + write_Staging_Folder) \ .option("forwardSparkAzureStorageCredentials", "true") \ .option("dbTable", "wwi_staging.SaleSmall") \ .option("postActions", "MERGE wwi.SaleSmall AS tgt USING (SELECT try_cast(TransactionId as uniqueidentifier) as TransactionId,CustomerId,ProductId,sum(try_cast(Quantity as tinyint)) as Quantity,sum(Price) as Price, sum(TotalAmount) as TotalAmount, TransactionDate as TransactionDateId, sum(ProfitAmount) as ProfitAmount, Hour, Minute,StoreId FROM wwi_staging.SaleSmall group by TransactionDate, StoreId, CustomerId, ProductId, TransactionId, Hour, Minute ) AS src ON (tgt.TransactionId = src.TransactionId) AND (tgt.CustomerId = src.CustomerId) AND (tgt.ProductId = src.ProductId) AND (tgt.TransactionDateId = src.TransactionDateId) AND (tgt.StoreId = src.StoreId) AND (tgt.Hour = src.Hour) AND (tgt.Minute = src.Minute) WHEN MATCHED AND tgt.Quantity <> src.Quantity THEN UPDATE SET tgt.Quantity = src.Quantity, tgt.ProfitAmount = src.ProfitAmount, tgt.TotalAmount = src.TotalAmount; ") \ .mode("overwrite").save()

Py4JJavaError Traceback (most recent call last)

in 1 # Test 2: Write some data into a staging table, then merge from staging table to target table ----> 2 df.write \ 3 .format("com.databricks.spark.sqldw") \ 4 .option("url", synapse_JDBCURL) \ 5 .option("user", dbutils.secrets.get(scope = secret_Scope, key = secret_SQLDBUser)) \ /databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 736 self.format(format) 737 if path is None: --> 738 self._jwrite.save() 739 else: 740 self._jwrite.save(path) /databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception) /databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o2573.save. : com.databricks.spark.sqldw.SqlDWSideException: Azure Synapse Analytics failed to execute the JDBC query produced by the connector. Underlying SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException: Parse error at line: 1, column: 859: Incorrect syntax near 'TotalAmount'. [ErrorCode = 103010] [SQLState = S0001] at com.databricks.spark.sqldw.Utils$.wrapExceptions(Utils.scala:723) at com.databricks.spark.sqldw.DefaultSource.createRelation(DefaultSource.scala:89) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:239) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:386) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:186) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:968) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:141) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:336) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:160) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:156) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:575) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:167) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:575) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:551) at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:156) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:156) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:141) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:132) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:186) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:959) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:427) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:396) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: java.sql.SQLException: Exception thrown in awaitResult: at com.databricks.spark.sqldw.JDBCWrapper.executeInterruptibly(SqlDWJDBCWrapper.scala:137) at com.databricks.spark.sqldw.JDBCWrapper.$anonfun$executeInterruptibly$1(SqlDWJDBCWrapper.scala:115) at com.databricks.spark.sqldw.JDBCWrapper.$anonfun$executeInterruptibly$1$adapted(SqlDWJDBCWrapper.scala:115) at com.databricks.spark.sqldw.JDBCWrapper.withPreparedStatement(SqlDWJDBCWrapper.scala:357) at com.databricks.spark.sqldw.JDBCWrapper.executeInterruptibly(SqlDWJDBCWrapper.scala:115) at com.databricks.spark.sqldw.SqlDwWriter.$anonfun$saveToSqlDW$12(SqlDwWriter.scala:297) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:377) at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:363) at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34) at com.databricks.spark.sqldw.SqlDwWriter.$anonfun$saveToSqlDW$11(SqlDwWriter.scala:297) at com.databricks.spark.sqldw.SqlDwWriter.$anonfun$saveToSqlDW$11$adapted(SqlDwWriter.scala:290) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at com.databricks.spark.sqldw.SqlDwWriter.$anonfun$saveToSqlDW$1(SqlDwWriter.scala:290) at com.databricks.spark.sqldw.SqlDwWriter.$anonfun$saveToSqlDW$1$adapted(SqlDwWriter.scala:73) at com.databricks.spark.sqldw.JDBCWrapper.withConnection(SqlDWJDBCWrapper.scala:335) at com.databricks.spark.sqldw.SqlDwWriter.saveToSqlDW(SqlDwWriter.scala:73) at com.databricks.spark.sqldw.DefaultSource.$anonfun$createRelation$3(DefaultSource.scala:122) at com.databricks.spark.sqldw.Utils$.wrapExceptions(Utils.scala:692) ... 44 more Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Parse error at line: 1, column: 859: Incorrect syntax near 'TotalAmount'. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3272) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:247) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:222) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.execute(SQLServerPreparedStatement.java:505) at com.databricks.spark.sqldw.JDBCWrapper.$anonfun$executeInterruptibly$2(SqlDWJDBCWrapper.scala:115) at com.databricks.spark.sqldw.JDBCWrapper.$anonfun$executeInterruptibly$2$adapted(SqlDWJDBCWrapper.scala:115) at com.databricks.spark.sqldw.JDBCWrapper.$anonfun$executeInterruptibly$3(SqlDWJDBCWrapper.scala:129) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more
jelther commented 2 years ago

My workaround for that was creating a Stored Procedure that would execute the MERGE statement. Hacky, but worked.

cynthiajiangatl commented 2 years ago

Thank you @jelther, stored procedure worked!