numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Hi, I am trying a use case to use multi writer to write data into different partitions with version 0 #61

Open torvalds-dev-testbot[bot] opened 10 months ago

torvalds-dev-testbot[bot] commented 10 months ago

Describe the problem you faced

Hi, I am trying a use case to use multi writer to write data into different partitions with version 0.14. I found this medium article https://medium.com/@simpsons/can-you-concurrently-write-data-to-apache-hudi-w-o-any-lock-provider-51ea55bf2dd6 which says I can do multi writing with writer 1 having in process lock which allows to perform services and writer 2 just writing the data with services turned off. I tried with configs given and one of the writes always fails with below error: 23/12/19 01:02:06 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 is aborting. 23/12/19 01:02:06 ERROR DataSourceInternalWriterHelper: Commit 20231219010014383 aborted 23/12/19 01:02:07 WARN BaseHoodieWriteClient: Cannot find instant 20231219010014383 in the timeline, for rollback 23/12/19 01:02:07 ERROR AppendDataExec: Data source write support org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite@6db6a766 aborted.

Configs Used: load_df_1.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL"). option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.InProcessLockProvider"). option("hoodie.metadata.enable","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

load_df_2.write.format("org.apache.hudi"). option("hoodie.datasource.write.recordkey.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.partitionpath.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.precombine.field", "xxxxxxxxxxxx"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.query.type", "snapshot"). option("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). option("hoodie.datasource.write.hive_style_partitioning", "true"). option("hoodie.cleaner.policy.failed.writes","LAZY"). option("hoodie.metadata.enable","false"). option("hoodie.table.services.enabled","false"). option(HoodieWriteConfig.TABLE_NAME, "xxxxxxxxxxxx"). mode("Overwrite"). save("xxxxxxxxxxxx")

Can someone help? Can this be done without using locks as per article or should I definitely use any recommended lock provider? (edited)

torvalds-dev-testbot[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev-testbot[bot] commented 10 months ago

This seems to be a problem related to Apache Hudi's multi-writer capability, where one of the writes is failing with the error reports given. It appears that there may be a configuration issue or perhaps some kind of concurrency problem.

Plan:

  1. Analyze the error logs.
  2. Check the configuration settings.
  3. Suggest possible fixes or workarounds.

Step 1: Analyze the error logs.

From the error logs, we can see that there is a problem with HoodieDataSourceInternalBatchWrite transaction. The error logs mention "Cannot find instant 20231219010014383 in the timeline, for rollback", which means the transaction that it's looking for in the timeline does not exist.

This might suggest that there is a concurrency issue, where perhaps two write operations are occurring around the same time and causing conflicts.

Step 2: Check the configuration settings.

The configuration seems to be setup for multiple writers with different settings. There might be a case where one setting might conflict with another. For example, different "hoodie.write.concurrency.mode" and "hoodie.write.lock.provider" might be causing a problem.

"HoodieLockConfig.newBuilder().withLockProvider(lockProvider)" and "'hoodie.write.concurrency.mode','OPTIMISTIC_CONCURRENCY_CONTROL"

These settings control how multiple write requests are handled and if the system optimistically assumes that conflicts will not happen, it might run into problems if some conflict does occur.

It would also be worth checking if enabling and disabling metadata with the "hoodie.metadata.enable" options affects anything.

Step 3: Suggest possible fixes or workarounds.

Since it looks like a concurrency issue, you might want to try to change the lock provider and write concurrency mode.

Additionally, here are some specific suggestions:

  1. Try to enable/disable the metadata option, as this sometimes could cause issues.
  2. Try to use a different lock provider or concurrency mode.
  3. Check if the save location of your dataframe is correct. You have given a placeholder in the script, make sure that it refers to a valid location.
  4. Finally, if the concurrency issue still persists, it might be worth considering semi-locking the writes or making sure that writes are substantially different so as not to cause conflicts.

Remember to always have a backup of your data before making changes.