[SUPPORT] New User Table Design Query

torvalds-dev-testbot[bot] commented 3 months ago

Describe the problem you faced I am a new user to Hudi and Parquet, and I have a table design question. I have structured my table in the following way:


"name": "foobar",
"metadata" : {
"key1" : "meta-val1",
"key2" : "meta-val2",
... 
"keyn" : "meta-valn"
},
"attributes" : {
"attr1" : ["val1"],
"attr2" : ["val20", "val21", "val22", "val23"],
... 
.. 
"attrn" : ["valno", "valn1", ..., "valnm"]
}
}```

I need to run the following queries:
1. Find record by name (select * from objects where name='foobar')
2. Find records by metadata (select * from objects where metadata.key1='something' and/or metadata.key2='something2' and/or... metadata.keyn='somethingn')
3. Find records by attributes (select * from objects where attributes.attr1='val1' and/or attributes.attr2='something2' and/or attributes.attr2='something2' and/or... attributes.attrn='somethingn')
4. Flavor of #3 with different operators, !=, starts, contains, ends, regex

**To Reproduce**
Steps to reproduce the behavior:
1.
2.
3.
4.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment Description**
* Hudi version :
* Spark version :
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) :
* Running on Docker? (yes/no) :

**Additional context**
Add any other context about the problem here.

**Stacktrace**
Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 3 months ago

Relevant Sources:

https://aws.amazon.com/blogs/big-data/part-1-get-started-with-apache-hudi-using-aws-glue-by-implementing-key-design-concepts/

https://hudi.apache.org/docs/faq_general

https://hudi.apache.org/docs/metadata

https://datacouch.medium.com/getting-started-with-apache-hudi-711b89c107aa

https://apache-hudi.slack.com/archives/C4D716NPQ/p1711852787193459

torvalds-dev-testbot[bot] commented 3 months ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 3 months ago

The structure you've provided seems suitable for an Apache Hudi table design with multiple possible queries. To proceed with this design, you can consider configuring the Hudi table using record keys and defining partition fields. It's important to understand the implications of defining record keys and how they impact write operations. Additionally, focusing on partition configuration within the Hudi table can greatly enhance read and write throughput efficiency, especially for petabyte-scale tables. Taking these aspects into account can help you optimize the performance of your table as you run the queries you've outlined.

Follow up questions:

Do you have specific requirements for the record key in your Hudi table?
Have you considered how partitioning will impact the efficiency of running the queries you've mentioned?
How critical is it for you to maintain historical records while optimizing for write and query performance?

Sources:

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/228

torvalds-dev-testbot[bot] commented 3 months ago

Abhishek posted:

https://devs-numberlabs.slack.com/archives/C06PDCQPA3W/p1722326194411479?thread_ts=1722326569817979&cid=C06PDCQPA3W

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/228

numberlabs-developers / hudi

[SUPPORT] New User Table Design Query #228