Open torvalds-dev-testbot[bot] opened 3 months ago
Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds
Yes, you are correct. The presence of two pipelines (backfilling and filling) concurrently writing to the same table location will enable concurrency control by default. This feature ensures that data anomalies are prevented by automatically managing conflicts between the concurrent writes. You can achieve this by setting the hoodie.write.concurrency.mode
to NON_BLOCKING_CONCURRENCY_CONTROL
and specifying the number of write tasks using the write.tasks
option. Additionally, to avoid executing compaction and cleaning services twice for the same table, you can manually disable them for one of the pipelines using specific configuration options. This setup guarantees that each write operation is handled correctly without causing any inconsistencies in the data.
Sources:
Yes, you are correct. The presence of two pipelines (backfilling and filling) concurrently writing to the same table location will enable concurrency control by default. This feature ensures that data anomalies are prevented by automatically managing conflicts between the concurrent writes. You can achieve this by setting the hoodie.write.concurrency.mode
to NON_BLOCKING_CONCURRENCY_CONTROL
and specifying the number of write tasks using the write.tasks
option. Additionally, to avoid executing compaction and cleaning services twice for the same table, you can manually disable them for one of the pipelines using specific configuration options. This setup guarantees that each write operation is handled correctly without causing any inconsistencies in the data.
Sources:
Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/246
Tips before filing an issue
Describe the problem you faced
I am seeking clarification on the behavior of concurrency control in Hudi when having two pipelines (backfilling and filling) that write to the same location. Will this setup automatically enable concurrency control to prevent data anomalies?
To Reproduce
Steps to reproduce the behavior: 1. 2. 3. 4.
Expected behavior
A clear and concise description of the expected behavior.
Environment Description
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.