[SUPPORT] Questions about Dynamo Multi-Writers

torvalds-dev / hudi

Upserts, Deletes And Incremental Processing on Big Data.

https://hudi.apache.org/

Apache License 2.0

0 stars 0 forks source link

[SUPPORT] Questions about Dynamo Multi-Writers #74

Open torvalds-dev[bot] opened 1 month ago

torvalds-dev[bot] commented 1 month ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced This GitHub ticket is to address the following questions regarding Dynamo Multi-Writers:

Is there any case in which a spark reader which has write configurations but is only READING a locked table would actually need to access DynamoDB to check the lock table for anything?
What fundamentally is hoodie.write.lock.dynamodb.partition_key? Is it just an arbitrary name to use within the lock table which is unrelated to the associated table itself? Is it a column name? Does it have any application outside of the lock table (if I have table A and table B with lock_A_use1 partition_key=abc and lock_B_use1 partition_key=abc would there be any conflict?)
Does concurrency support tables partitioned on multiple columns? a. If I have writers W1 and W2 that write to Partitions W1 = partition_col_A:data1 partition_col_B:data1 and W2=partition_col_A:data1 partition_col_B:data2, would writer 2 succeed, wait for the lock, or fail? I would expect writer 2 to succeed, but I'm not 100% on it.

To Reproduce Steps to reproduce the behavior: 1. 2. 3. 4.

Expected behavior A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context Add any other context about the problem here.

Stacktrace Add the stacktrace of the error.

torvalds-dev[bot] commented 1 month ago

Relevant Sources:

https://hudi.apache.org/docs/concurrency_control

https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6

https://medium.com/@simpsons/can-you-concurrently-write-data-to-apache-hudi-w-o-any-lock-provider-51ea55bf2dd6

https://hudi.apache.org/docs/faq_writing_tables

https://apache-hudi.slack.com/archives/C4D716NPQ/p1711135382476629

torvalds-dev[bot] commented 1 month ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev[bot] commented 1 month ago

In the case of a Spark reader that has write configurations but is only READING a locked table, it may still need to access DynamoDB to check the lock table for anything. This is because even though it is in the process of reading and not writing, in a multi-writing scenario, the lock table may need to be checked for locks to ensure concurrency control and coordination among different writers. (Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6)
The hoodie.write.lock.dynamodb.partition_key is a key configuration parameter that is used within the lock table. It is an important field that helps in managing locking at the partition level, ensuring that different writers do not conflict when writing to the same partition. This key is specific to the lock table and is used for conflict resolution among concurrent writers. It does not have any direct application outside of the lock table. If two tables, Table A and Table B, have different partition keys (e.g., lock_A_use1 and lock_B_use1), there should not be any conflict as long as they are distinct. (Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6)
Hudi does support tables partitioned on multiple columns, and in the scenario where Writer W1 is writing to partition_col_A:data1 partition_col_B:data1 and Writer W2 is writing to partition_col_A:data1 partition_col_B:data2, Writer 2 should succeed. In a multi-writing environment with partitioned tables, different writers can work on different partitions without conflicts as long as they are not trying to update the same data simultaneously. Therefore, in this case, Writer 2 is expected to succeed without waiting for the lock or facing failures. (Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6)

torvalds-dev[bot] commented 1 month ago

In the case of a Spark reader that has write configurations but is only READING a locked table, it may still need to access DynamoDB to check the lock table for anything. This is because even though it is in the process of reading and not writing, in a multi-writing scenario, the lock table may need to be checked for locks to ensure concurrency control and coordination among different writers. (Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6)
The hoodie.write.lock.dynamodb.partition_key is a key configuration parameter that is used within the lock table. It is an important field that helps in managing locking at the partition level, ensuring that different writers do not conflict when writing to the same partition. This key is specific to the lock table and is used for conflict resolution among concurrent writers. It does not have any direct application outside of the lock table. If two tables, Table A and Table B, have different partition keys (e.g., lock_A_use1 and lock_B_use1), there should not be any conflict as long as they are distinct. (Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6)
Hudi does support tables partitioned on multiple columns, and in the scenario where Writer W1 is writing to partition_col_A:data1 partition_col_B:data1 and Writer W2 is writing to partition_col_A:data1 partition_col_B:data2, Writer 2 should succeed. In a multi-writing environment with partitioned tables, different writers can work on different partitions without conflicts as long as they are not trying to update the same data simultaneously. Therefore, in this case, Writer 2 is expected to succeed without waiting for the lock or facing failures. (Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6)

torvalds-dev[bot] commented 1 month ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/74

torvalds-dev[bot] commented 1 month ago

Hello! :wave: Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev[bot] commented 1 month ago

A spark reader with write configurations but only READING a locked table would not need to access DynamoDB to check the lock table unless there is a need for conflict resolution during the write process, where locks are acquired for a short duration at specific phases like committing writes or scheduling table services. Source: https://hudi.apache.org/docs/concurrency_control
hoodie.write.lock.dynamodb.partition_key is an arbitrary name used within the lock table, unrelated to the associated table itself, and not a column name. It helps differentiate lock entries in the lock table to prevent conflicts. It does not have any application outside the lock table and would not cause any conflict if different tables use the same partition key. Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6
For tables partitioned on multiple columns, in a scenario where writers W1 and W2 write to different partitions (W1 = partition_col_A:data1, partition_col_B:data1 and W2 = partition_col_A:data1, partition_col_B:data2), writer W2 would succeed without waiting for the lock or failing since they are writing to different partitions. Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6 :large_blue_square: Confidence Score: 92.14% :loudspeaker: Publish Answer button >torvalds.dev is learning and improving. React below to provide feedback! :+1: 0 button :-1: 0 button

torvalds-dev[bot] commented 1 month ago

A spark reader with write configurations but only READING a locked table would not need to access DynamoDB to check the lock table unless there is a need for conflict resolution during the write process, where locks are acquired for a short duration at specific phases like committing writes or scheduling table services. Source: https://hudi.apache.org/docs/concurrency_control
hoodie.write.lock.dynamodb.partition_key is an arbitrary name used within the lock table, unrelated to the associated table itself, and not a column name. It helps differentiate lock entries in the lock table to prevent conflicts. It does not have any application outside the lock table and would not cause any conflict if different tables use the same partition key. Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6
For tables partitioned on multiple columns, in a scenario where writers W1 and W2 write to different partitions (W1 = partition_col_A:data1, partition_col_B:data1 and W2 = partition_col_A:data1, partition_col_B:data2), writer W2 would succeed without waiting for the lock or failing since they are writing to different partitions. Source: https://medium.com/@simpsons/multi-writer-support-with-apache-hudi-e1b75dca29e6