torvalds-dev / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Questions regarding Dynamo Multi-Writers #70

Open torvalds-dev[bot] opened 1 month ago

torvalds-dev[bot] commented 1 month ago

Describe the problem you faced

Three questions regarding Dynamo Multi-Writers:

  1. Is there any case in which a spark reader which has write configurations but is only READING a locked table would actually need to access DynamoDB to check the lock table for anything?
  2. What fundamentally is hoodie.write.lock.dynamodb.partition_key? Is it just an arbitrary name to use within the lock table which is unrelated to the associated table itself? Is it a column name? Does it have any application outside of the lock table (if I have table A and table B with lock_A_use1 partition_key=abc and lock_B_use1 partition_key=abc would there be any conflict?)
  3. Does concurrency support tables partitioned on multiple columns?
    • If I have writers W1 and W2 that write to Partitions W1 = partition_col_A:data1 partition_col_B:data1 and W2=partition_col_A:data1 partition_col_B:data2, would writer 2 succeed, wait for the lock, or fail? I would expect writer 2 to succeed, but I'm not 100% on it.

To Reproduce

Expected behavior

Environment Description

Additional context

Stacktrace

torvalds-dev[bot] commented 1 month ago

Relevant Sources:

https://hudi.apache.org/docs/concurrency_control

https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6

https://medium.com/@simpsons/can-you-concurrently-write-data-to-apache-hudi-w-o-any-lock-provider-51ea55bf2dd6

https://hudi.apache.org/docs/faq_writing_tables

https://apache-hudi.slack.com/archives/C4D716NPQ/p1711135382476629

torvalds-dev[bot] commented 1 month ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev[bot] commented 1 month ago
  1. In the case where a Spark reader with write configurations is only reading a locked table, it may need to access DynamoDB to check the lock table for any potential conflicts. This is essential for conflict resolution purposes in scenarios where two writers are concurrently updating the same set of data in the Hudi table (source: http://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6).

  2. hoodie.write.lock.dynamodb.partition_key is a configuration parameter used within the lock table in DynamoDB. It acts as a way to differentiate locks for different writers and is specific to the lock table, unrelated to the associated table itself. If different tables use the same partition key for locks, conflicts may arise (source: http://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6).

  3. Hudi supports tables partitioned on multiple columns, and in the scenario described where writer W1 writes to partition_col_A:data1 and partition_col_B:data1 while writer W2 writes to partition_col_A:data1 and partition_col_B:data2, writer 2 would succeed in writing its data. This is because they are writing to different values on partition_col_B despite sharing the same value on partition_col_A (source: http://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6).

torvalds-dev[bot] commented 1 month ago

Torvalds.​Dev posted:

https://torvaldsdevworkspace.slack.com/archives/C077KAJCZUP/p1722450786615829?thread_ts=1722450877426689&cid=C077KAJCZUP

torvalds-dev[bot] commented 1 month ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/70