numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Hi, I am trying a use case to use multi writer to write data into different partitions with version 0 #60

Open torvalds-dev-testbot[bot] opened 10 months ago

torvalds-dev-testbot[bot] commented 10 months ago

Describe the problem you faced

Hi, I am trying a use case to use multi writer to write data into different partitions with version 0.14. I found this medium article https://medium.com/@simpsons/can-you-concurrently-write-data-to-apache-hudi-w-o-any-lock-provider-51ea55bf2dd6 which says I can do multi writing with writer 1 having in process lock which allows to perform services and writer 2 just writing the data with services turned off. I tried with configs given and one of the writes always fails with below error: 23/12/19 01:02:06 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 is aborting. 23/12/19 01:02:06 ERROR DataSourceInternalWriterHelper: Commit 20231219010014383 aborted 23/12/19 01:02:07 WARN BaseHoodieWriteClient: Cannot find instant 20231219010014383 in the timeline, for rollback 23/12/19 01:02:07 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 aborted.

Configs Used: load_df_1.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL"). option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.InProcessLockProvider"). option("hoodie.metadata.enable","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

load_df_2.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.metadata.enable","false"). option("hoodie.table.services.enabled","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

Can someone help? Can this be done without using locks as per article or should I definitely use any recommended lock provider? (edited)

torvalds-dev-testbot[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev-testbot[bot] commented 10 months ago

It seems like Apache Hudi has thrown errors during multi-writing into different data partitions. The error log suggests that the data source write support is aborting and an instant in the timeline cannot be found for rollback.

This might be happening due to a few reasons, including but not limited to:

  1. Concurrent writes may cause conflicts which can lead to aborting of data source write support. Apache Hudi uses a timeline and instants concept to support rollbacks, if it fails to find any instant in the timeline corresponding to the aborted job, it triggers the warning that you have observed.
  2. Proper LockProviders are not implemented to avoid such data conflicts.
  3. Your configuration might be having some inconsistencies or missing information.

Given this information, here's how I suggest we proceed:

Plan:

  1. Inspect the configurations used for the writes, specifically looking for differences between the successful and the failed writes.
  2. Compare the configurations with the recommended configurations in the Medium article.
  3. Perform checks on dependencies and package versions, ensuring that you're using the correct and compatible versions.
  4. In case of unsuccessful mitigation, check the environment setup and potential cluster issues.

I'll need additional information from you for the first step. Are there certain configurations that explicitly differ between the successful and unsuccessful writes? If yes, what are they?