[SUPPORT] Concurrency Control Query

torvalds-dev-testbot[bot] commented 3 months ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I am seeking clarification on the behavior of concurrency control in Hudi when having two pipelines (backfilling and filling) that write to the same location. Will this setup automatically enable concurrency control to prevent data anomalies?

To Reproduce

Steps to reproduce the behavior: 1. 2. 3. 4.

Expected behavior

A clear and concise description of the expected behavior.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 3 months ago

Relevant Sources:

https://hudi.apache.org/docs/writing_data

https://hudi.apache.org/docs/concurrency_control

https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-710

https://apache-hudi.slack.com/archives/C4D716NPQ/p1713910144087309

torvalds-dev-testbot[bot] commented 3 months ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 3 months ago

Yes, you are correct. The presence of two pipelines (backfilling and filling) concurrently writing to the same table location will enable concurrency control by default. This feature ensures that data anomalies are prevented by automatically managing conflicts between the concurrent writes. You can achieve this by setting the hoodie.write.concurrency.mode to NON_BLOCKING_CONCURRENCY_CONTROL and specifying the number of write tasks using the write.tasks option. Additionally, to avoid executing compaction and cleaning services twice for the same table, you can manually disable them for one of the pipelines using specific configuration options. This setup guarantees that each write operation is handled correctly without causing any inconsistencies in the data.

Sources:

torvalds-dev-testbot[bot] commented 3 months ago

Yes, you are correct. The presence of two pipelines (backfilling and filling) concurrently writing to the same table location will enable concurrency control by default. This feature ensures that data anomalies are prevented by automatically managing conflicts between the concurrent writes. You can achieve this by setting the hoodie.write.concurrency.mode to NON_BLOCKING_CONCURRENCY_CONTROL and specifying the number of write tasks using the write.tasks option. Additionally, to avoid executing compaction and cleaning services twice for the same table, you can manually disable them for one of the pipelines using specific configuration options. This setup guarantees that each write operation is handled correctly without causing any inconsistencies in the data.

Sources:

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/246

numberlabs-developers / hudi