microsoft / sql-spark-connector

Apache Spark Connector for SQL Server and Azure SQL
Apache License 2.0
273 stars 116 forks source link

Support for Spark 3.4.0 #227

Closed psolomin closed 11 months ago

psolomin commented 1 year ago

I've tried this connector with Spark 3.4.0 on Java 17 and it seems working for basic write and read operations. Is there a full test set anywhere which must run before Spark 3.4.0 support is added in the README?

P.S.: I needed to exclude Scala dependency:

    implementation("com.microsoft.azure:spark-mssql-connector_2.12:1.3.0-BETA") {
        exclude group: 'org.scala-lang'
    }

Otherwise it was conflicting with Spark's scala 2.12.17

hlngo commented 1 year ago

@psolomin Can you elaborate on how to exclude Scala dependency? What file should I add your snippet to? I am new & trying to set up pyspark 3.4.0 on my local machine.

psolomin commented 1 year ago

Hello @hlngo

What file should I add your snippet to?

It's in my build.gradle - I use Gradle to create a set of jars for Spark to use. Here's a part of my build.gradle file:

configurations {
    userLibs.extendsFrom implementation
}

dependencies {
    // this are used for spark.driver.extraClassPath and spark.executor.extraClassPath
    userLibs("com.microsoft.azure:spark-mssql-connector_2.12:1.3.0-BETA") {
        exclude group: 'org.scala-lang'
    }
}

task copyUserDeps(type: Copy) {
    from configurations.userLibs
    into "${userDepsLocation}"
}

When I run ./gradlew copyUserDeps it fetches stuff under dependencies + all transitive dependencies (except Scala) and puts them into userDepsLocation which I configure.

Then, when I run my apps, I use:

spark-submit ... --conf spark.driver.extraClassPath="$userDepsLocation/*" --conf spark.executor.extraClassPath="$userDepsLocation/*"

Hope it helps.

zhukovgreen commented 1 year ago

Trying the same thing, but problems on writing:

java.lang.NoSuchMethodError: 'org.apache.spark.sql.types.StructType org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(java.sql.ResultSet, org.apache.spark.sql.jdbc.JdbcDialect, boolean)'

Tried on jdk8 and 11

psolomin commented 1 year ago

Hi @zhukovgreen

I did not see this when tried inserting into a table of primitives. What was your Spark DF schema you attempted to insert? Did that insertion into MS SQL work with previous versions of Spark?

zhukovgreen commented 1 year ago

Hi @psolomin. With the previous version of spark it was working. I had the following write config:

options = {
    'truncate': 'true', 
    'batchsize': 1000000, 
    'tableLock': 'true', 
    'numPartitions': 25, 
    'use_bulk_copy': True, 
    'schemaCheckEnabled': 'true', 
    'driver': 'com.microsoft.sqlserver.jdbc.SQLServerDriver', 
    'username': 'uname', 
    'password': '****', 
    'url': 'jdbc:sqlserver://mymssqlserver:1433;database=DB;encrypt=true;trustServerCertificate=true;', 
    'dbtable': 'dbo.QueueItemTrainingRecommend', 
    'batchSize': 100000, 
    'checkConstraints': 'true', 
    'reliabilityLevel': 'BEST_EFFORT',
    }

I was writing using:

to_write.write.format("com.microsoft.sqlserver.jdbc.spark").options(**options).mode("overwrite").save()

The schema of the to_write df is:

root
 |-- Comp: string (nullable = true)
 |-- Empl: string (nullable = true)
 |-- UniqueId string (nullable = true)
 |-- WorkData: string (nullable = true)
 |-- CreatedAt: timestamp (nullable = true)

. Reading the exception I see that it has one boolean as last argument, but in spark 3.4 the signature has changed and it has two boleans.

See :

psolomin commented 1 year ago

Thanks for sharing the details @zhukovgreen !

I think, in all my use cases timestamps were passed to MS SQL as strings. If I bump into the same issue I will think of forking and patching this.

Vislesha commented 1 year ago

Any timeline for release of this connector for Spark 3.4 with Scala 2.13 and Java 17 support?

psolomin commented 1 year ago

@zhukovgreen just a follow-up:

I've tried inserting timestamp Spark types into smalldatetime column, and it worked fine. I will run further checks there anyway though.

Note: In my use-cases, I don't let Spark creating tables, they are created via plain SQL in a different tool, with indexes, etc. Spark only writes there.

zhukovgreen commented 1 year ago

thanks @psolomin, then consider my comment as resolved. I will double-check my case once the spark3.4 is ready to use. Make sense?