[SUPPORT]Getting errors while using multi writers

numberlabs-developers commented 10 months ago

Describe the problem you faced

Hi, I am trying a use case to use multi writer to write data into different partitions with version 0.14. I found this medium article https://medium.com/@simpsons/can-you-concurrently-write-data-to-apache-hudi-w-o-any-lock-provider-51ea55bf2dd6 which says I can do multi writing with writer 1 having in process lock which allows to perform services and writer 2 just writing the data with services turned off. I tried with configs given and one of the writes always fails with below error: 23/12/19 01:02:06 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 is aborting. 23/12/19 01:02:06 ERROR DataSourceInternalWriterHelper: Commit 20231219010014383 aborted 23/12/19 01:02:07 WARN BaseHoodieWriteClient: Cannot find instant 20231219010014383 in the timeline, for rollback 23/12/19 01:02:07 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 aborted.

Configs Used: load_df_1.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL"). option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.InProcessLockProvider"). option("hoodie.metadata.enable","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

load_df_2.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.metadata.enable","false"). option("hoodie.table.services.enabled","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

Can someone help? Can this be done without using locks as per article or should I definitely use any recommended lock provider? (edited)

torvalds-dev-testbot[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev-testbot[bot] commented 10 months ago

I am unable to evaluate and provide a definitive answer on the compatibility of your system as you have not provided specific versions for Spark, Hive, Hadoop, and the type of Storage. Also, you didn't provide whether you're running this on Docker, as all these variables can affect the functionality of Apache Hudi.

From the documentation you provided, for Hudi 0.14.x, it does require specific versions of Spark (3.x), Hadoop (3.2.x), and Hive (3.x). Therefore, please provide more specific versions of these different software in your environment for a conclusive check of compatibility.

Remember, without complete information, predicting the compatibility of the software with Hudi version 0.14.x might produce inaccurate results.

numberlabs-developers / hudi

[SUPPORT]Getting errors while using multi writers #65