[SUPPORT] Hi, I am trying a use case to use multi writer to write data into different partitions with version 0

torvalds-dev-testbot[bot] commented 10 months ago

Describe the problem you faced

Hi, I am trying a use case to use multi writer to write data into different partitions with version 0.14. I found this medium article https://medium.com/@simpsons/can-you-concurrently-write-data-to-apache-hudi-w-o-any-lock-provider-51ea55bf2dd6 which says I can do multi writing with writer 1 having in process lock which allows to perform services and writer 2 just writing the data with services turned off. I tried with configs given and one of the writes always fails with below error: 23/12/19 01:02:06 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 is aborting. 23/12/19 01:02:06 ERROR DataSourceInternalWriterHelper: Commit 20231219010014383 aborted 23/12/19 01:02:07 WARN BaseHoodieWriteClient: Cannot find instant 20231219010014383 in the timeline, for rollback 23/12/19 01:02:07 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 aborted.

Configs Used: load_df_1.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL"). option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.InProcessLockProvider"). option("hoodie.metadata.enable","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

load_df_2.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.metadata.enable","false"). option("hoodie.table.services.enabled","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

Can someone help? Can this be done without using locks as per article or should I definitely use any recommended lock provider? (edited)

torvalds-dev-testbot[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev-testbot[bot] commented 10 months ago

It seems like you are experiencing problems with multi-writing when using Apache Hudi. The error messages suggest that attempt to write the data resulted in a failure and the data source write support was aborted.

Plan:

To understand more about the issue, we should check the version of Apache Hudi that is currently being used. Apache Hudi has made several updates and improvements regarding multi-writes and concurrency control in their more recent versions. If you're using an older version, that might be the source of the problem.
It could also be that there's a problem with the configuration options you've provided. We can confirm the current configurations used to ensure that there's no misconfigured property.
We'll also check the conditions under which you're attempting the multi-writes to rule out any problems with the data or the environment. This includes the size of the datasets being written and system resources available at the time of the write operation.

Let's start with step 1, we should first verify the Apache Hudi version currently in use.

numberlabs-developers / hudi

[SUPPORT] Hi, I am trying a use case to use multi writer to write data into different partitions with version 0 #62