snowflakedb / snowflake-jdbc

Snowflake JDBC Driver
Apache License 2.0
173 stars 164 forks source link

SNOW-264052: Data inconsistency when writing to Snowflake using multithreading #433

Closed vaibhavsingh007 closed 3 years ago

vaibhavsingh007 commented 3 years ago

Hi, I heard concurrent write to snowflake using same jdbc connection is threadsafe, as it ought to be however, I encountered the following issue:

" Hi, so I just ran into an issue where k records from across n Spark dataframes are written as k+y where y is non-deterministic, using concurrent write using same SF jdbc connection (drivers: spark-snowflake_2.11-2.5.2-spark_2.4.jar, snowflake-jdbc-3.9.1.jar).

write config:

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

    # Set options below
    sfOptions = {
        "sfURL" : "****",
        "sfAccount" : "****", # Also needed create stage privilege.
        "sfUser" : username,
        "sfPassword" : password,
        "sfDatabase" : database,
        "sfSchema" : schema,
        "sfWarehouse" : warehouse,
        "sfRole" : role
    }

    sparkDF.write \
    .format(SNOWFLAKE_SOURCE_NAME) \
    .options(**sfOptions) \
    .option("dbtable", f"MY_WORKSPACE.{table}") \
    .mode('append') \
    .save()

How can I synchronize this to achieve data integrity? "

ref: https://github.com/snowflakedb/snowflake-jdbc/issues/3#issuecomment-760333136

vaibhavsingh007 commented 3 years ago

Issue resolved using latest drivers, spark-snowflake_2.11-2.8.3-spark_2.4.jar, snowflake-jdbc-3.12.17.jar The data is now consistent.