microsoft / sql-spark-connector

Apache Spark Connector for SQL Server and Azure SQL
Apache License 2.0
273 stars 116 forks source link

Assessing the risk of duplicated entries for BEST_EFFORT reliabilityLevel #252

Open jcblancomartinez opened 8 months ago

jcblancomartinez commented 8 months ago

Hi,

From here:

Implements the BEST_EFFORT write strategy for Single Instance. All executors insert into a user specified table directly. Write to table is not transactional and may results in duplicates in executor restart scenarios.

I'm not sure I understand when we could end up with duplicates when using BEST_EFFORT reliabilityLevel.

For each Spark's dataframe partition we call savePartition and savePartition only commits the transaction after SQLServerBulkCopy.writeToServer has succeded.

Thanks.

jcblancomartinez commented 8 months ago

@shivsood @luxu1-ms could you please help here?

Thanks.