numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Handling multiple DeltaStreamers writing to the same target table #236

Open torvalds-dev-testbot[bot] opened 3 months ago

torvalds-dev-testbot[bot] commented 3 months ago

Tips before filing an issue

Describe the problem you faced

I would like to understand how Hudi deals with multiple DeltaStreamers running for the same table. Does it put one in a wait state until the other finishes, or does this lead to table corruption?

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 3 months ago

Relevant Sources:

https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6

https://hudi.apache.org/docs/faq_writing_tables

https://hudi.apache.org/docs/concurrency_control

https://github.com/apache/hudi/issues/10246

torvalds-dev-testbot[bot] commented 3 months ago

Not helpful resources

torvalds-dev-testbot[bot] commented 3 months ago

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 3 months ago

Running two DeltaStreamers writing to the same target table in Apache Hudi is handled through multi-writer support. If two writers are writing to different sets of data within the same table, both can concurrently write and successfully commit. However, if writers are updating the same set of data, conflict resolution occurs through Optimistic Concurrency Control (OCC). With OCC, one of the writers will succeed while the other will abort if there are conflicting writes to the same underlying file groups. Multi-writer support in Hudi involves Conflict Resolution component that deduces conflicting operations and employs a conflict resolution strategy to handle conflicts. By default, SimpleConcurrentFileWritesConflictResolutionStrategy operates at the File Group level, allowing multiple writers to succeed when writing to different file groups. If two writers write to overlapping file groups, conflict resolution decides which write succeeds. This system operates at the file group level, not record level, to optimize performance.

Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6 Source: https://hudi.apache.org/docs/concurrency_control

torvalds-dev-testbot[bot] commented 3 months ago

Abhishek posted:

https://devs-numberlabs.slack.com/archives/C06PDCQPA3W/p1722432417358549?thread_ts=1722432570074649&cid=C06PDCQPA3W

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/236