[SUPPORT] Steps for upgrading from Hudi 0.6.0 to 0.12.1 and expectations post-upgrade

torvalds-dev[bot] commented 1 month ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Our organization is currently migrating from Hudi 0.6.0 to Hudi 0.12.1 and updating the required Spark and EMR versions. The existing data sets, which are 100s of TBs on S3, are written using Hudi 0.6.0. We are unsure about how to directly use 0.12.1 and would like to request steps for the upgrade process. We are particularly concerned about rebuilding our tables as they contain billions of records.

Expected behavior

Provide steps for upgrading from Hudi 0.6.0 to 0.12.1 and clarify if rebuilding the tables is necessary when dealing with tables holding billions of records.

Additional context

Should we expect the following improvements after the upgrade:

Faster upserts
Columns add/modify (schema evolution)
Clustering
Possible solutions for storing the history of updates performed on records

Environment Description

Hudi version: 0.6.0
Spark version:
Hive version:
Hadoop version:
Storage (HDFS/S3/GCS..): S3
Running on Docker? (yes/no): No

Stacktrace

Add the stacktrace of the error.

torvalds-dev[bot] commented 1 month ago

Relevant Sources:

https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables/#spark-data-source

https://hudi.apache.org/docs/deployment

https://www.onehouse.ai/blog/how-apache-hudi-simplifies-mpp-data-warehouse-migrations

https://apache-hudi.slack.com/archives/C4D716NPQ/p1713562820086629

torvalds-dev[bot] commented 1 month ago

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev[bot] commented 1 month ago

Follow up questions:

Could you describe the version of Hudi and other tech stack in your environment? If you are facing an error, please provide a full stack trace and describe what operations led to the errors. Some other artifacts that are useful to triage would be Spark UI screenshots.
Have you configured the metastore uris for syncing Hudi tables to Hive?
Can you please share the Hudi write configs and versions of Hudi and the compute engine you are using?
In your dataset, are you currently facing any issues with duplicate records? If yes, could you provide the query used to detect duplicates?
Can you specify the indexing scheme, table type, table services, number of writers used in parallel, catalog, Hudi version, metadata table status, enabled indexes, Hudi timeline, and partitioning scheme you are utilizing in Hudi 0.6.0 and planning to update to 0.12.1?

Sources:

torvalds-dev[bot] commented 1 month ago

Follow up questions:

Could you describe the version of Hudi and other tech stack in your environment? If you are facing an error, please provide a full stack trace and describe what operations led to the errors. Some other artifacts that are useful to triage would be Spark UI screenshots.
Have you configured the metastore uris for syncing Hudi tables to Hive?
Can you please share the Hudi write configs and versions of Hudi and the compute engine you are using?
In your dataset, are you currently facing any issues with duplicate records? If yes, could you provide the query used to detect duplicates?
Can you specify the indexing scheme, table type, table services, number of writers used in parallel, catalog, Hudi version, metadata table status, enabled indexes, Hudi timeline, and partitioning scheme you are utilizing in Hudi 0.6.0 and planning to update to 0.12.1?

Sources:

torvalds-dev[bot] commented 1 month ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/72

torvalds-dev[bot] commented 1 month ago

Thank you for ticket creation.

torvalds-dev[bot] commented 1 month ago

Hello! :wave: Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev[bot] commented 1 month ago

Follow up questions:

Could you describe the nature of your dataset and whether you perform inserts or updates frequently?
Have you considered the options for upgrading selectively, such as converting newer partitions to Hudi or full conversion to Hudi?
Do you have any specific concerns or challenges in migrating from Hudi 0.6.0 to 0.12.1 that you would like to highlight?

Sources:

https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables/#spark-data-source
https://hudi.apache.org/docs/deployment
https://medium.com/@devjain1299/apache-hudi-on-aws-glue-a-step-by-step-guide-503c34a9aa95
https://www.onehouse.ai/blog/how-apache-hudi-simplifies-mpp-data-warehouse-migrations
https://hudi.apache.org/docs/cli :large_blue_square: Confidence Score: 88.13% :loudspeaker: Publish Answer button >torvalds.dev is learning and improving. React below to provide feedback! :+1: 0 button :-1: 0 button

torvalds-dev[bot] commented 1 month ago

Follow up questions:

Could you describe the nature of your dataset and whether you perform inserts or updates frequently?
Have you considered the options for upgrading selectively, such as converting newer partitions to Hudi or full conversion to Hudi?
Do you have any specific concerns or challenges in migrating from Hudi 0.6.0 to 0.12.1 that you would like to highlight?

Sources:

torvalds-dev / hudi

[SUPPORT] Steps for upgrading from Hudi 0.6.0 to 0.12.1 and expectations post-upgrade #72