microsoft / sql-spark-connector

Apache Spark Connector for SQL Server and Azure SQL
Apache License 2.0
274 stars 116 forks source link

AWS GLUE save. com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$$anonfun$write$2 #187

Closed girimedi closed 2 years ago

girimedi commented 2 years ago

I have previous got mutliple errors. I now added builder.config('spark.jars.packages','com.microsoft.azure:spark-mssql-connector:1.0.2') to spark session builder and got this error.

I have connector jar and mssql-jdbc-8.4.1.jre8.jar in my jars path.

Is anyone here successful in getting this to work on aws glue?

py4j.protocol.Py4JJavaError: An error occurred while calling o137.save.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 29, 10.1.0.9, executor 1): java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$anonfun$write$2 at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.JavaDeserializationStream$anon$1.resolveClass(JavaSerializer.scala:67) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1985) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1849) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2159) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1.apply(RDD.scala:935) at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1.apply(RDD.scala:933) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933) at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.write(BestEffortSingleInstanceStrategy.scala:42) at com.microsoft.sqlserver.jdbc.spark.SingleInstanceConnector$.writeInParallel(SingleInstanceConnector.scala:35) at com.microsoft.sqlserver.jdbc.spark.Connector.write(Connector.scala:80) at com.microsoft.sqlserver.jdbc.spark.DefaultSource.createRelation(DefaultSource.scala:66) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method. 2022-05-13 23:15:35,750 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last): File "/tmp/MSSQL-jar versions.py", line 157, in .option('driver','com.microsoft.sqlserver.jdbc.SQLServerDriver') \ File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 732, in save self._jwrite.save() File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o137.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 29, 10.1.0.9, executor 1): java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$anonfun$write$2 at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.JavaDeserializationStream$anon$1.resolveClass(JavaSerializer.scala:67) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1985) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1849) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2159) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1.apply(RDD.scala:935) at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1.apply(RDD.scala:933) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933) at com.microsoft.sqlserver.jdbc.spark.SingleInstanceWriteStrategies$.write(BestEffortSingleInstanceStrategy.scala:42) at com.microsoft.sqlserver.jdbc.spark.SingleInstanceConnector$.writeInParallel(SingleInstanceConnector.scala:35) at com.microsoft.sqlserver.jdbc.spark.Connector.write(Connector.scala:80) at com.microsoft.sqlserver.jdbc.spark.DefaultSource.createRelation(DefaultSource.scala:66) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.

Was anyone here able to successfully use this connector for glue job ?

arvindshmicrosoft commented 2 years ago

I have no experience with AWS Glue, but in principle, note that there are different versions of this connector to match different Spark runtime versions. I do not know what version of Spark is provided by the Glue runtime - but do match that to the versions we provide for this connector on Maven:

https://github.com/microsoft/sql-spark-connector/blob/b26d95321600b469836b6204f455f7c5237d4382/README.md?plain=1#L14-L18

girimedi commented 2 years ago

was finally able to get it to work. This made it work for me

under Dependent JARs path: s3://{mys3bucketlocation}/mssql-jdbc-9.2.1.jre8.jar,s3://{mys3bucketlocation}/spark-mssql-connector_2.12-1.2.0.jar

job parameters: --user-jars-first set to true

spark session: spark = glueContext.spark_session.builder.config("spark.jars", "s3://{mys3bucketlocation}/mssql-jdbc-9.2.1.jre8.jar,s3://{mys3bucketlocation}/spark-mssql-connector_2.12-1.2.0.jar").config("spark.jars.packages","com.microsoft.sqlserver.jdbc.SQLServerDriver,com.microsoft.azure:spark-mssql-connector_2.12:1.2.0").getOrCreate()