torvalds-dev / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Steps for upgrading from Hudi 0.6.0 to 0.12.1 and expectations post-upgrade #72

Open torvalds-dev[bot] opened 1 month ago

torvalds-dev[bot] commented 1 month ago

Tips before filing an issue

Describe the problem you faced

Our organization is currently migrating from Hudi 0.6.0 to Hudi 0.12.1 and updating the required Spark and EMR versions. The existing data sets, which are 100s of TBs on S3, are written using Hudi 0.6.0. We are unsure about how to directly use 0.12.1 and would like to request steps for the upgrade process. We are particularly concerned about rebuilding our tables as they contain billions of records.

Expected behavior

Provide steps for upgrading from Hudi 0.6.0 to 0.12.1 and clarify if rebuilding the tables is necessary when dealing with tables holding billions of records.

Additional context

Should we expect the following improvements after the upgrade:

Environment Description

Stacktrace

Add the stacktrace of the error.

torvalds-dev[bot] commented 1 month ago

Relevant Sources:

https://hudi.apache.org/blog/2020/08/20/efficient-migration-of-large-parquet-tables/#spark-data-source

https://hudi.apache.org/docs/deployment

https://www.onehouse.ai/blog/how-apache-hudi-simplifies-mpp-data-warehouse-migrations

https://apache-hudi.slack.com/archives/C4D716NPQ/p1713562820086629

torvalds-dev[bot] commented 1 month ago

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev[bot] commented 1 month ago

Follow up questions:

  1. Could you describe the version of Hudi and other tech stack in your environment? If you are facing an error, please provide a full stack trace and describe what operations led to the errors. Some other artifacts that are useful to triage would be Spark UI screenshots.
  2. Have you configured the metastore uris for syncing Hudi tables to Hive?
  3. Can you please share the Hudi write configs and versions of Hudi and the compute engine you are using?
  4. In your dataset, are you currently facing any issues with duplicate records? If yes, could you provide the query used to detect duplicates?
  5. Can you specify the indexing scheme, table type, table services, number of writers used in parallel, catalog, Hudi version, metadata table status, enabled indexes, Hudi timeline, and partitioning scheme you are utilizing in Hudi 0.6.0 and planning to update to 0.12.1?

Sources:

torvalds-dev[bot] commented 1 month ago

Follow up questions:

  1. Could you describe the version of Hudi and other tech stack in your environment? If you are facing an error, please provide a full stack trace and describe what operations led to the errors. Some other artifacts that are useful to triage would be Spark UI screenshots.
  2. Have you configured the metastore uris for syncing Hudi tables to Hive?
  3. Can you please share the Hudi write configs and versions of Hudi and the compute engine you are using?
  4. In your dataset, are you currently facing any issues with duplicate records? If yes, could you provide the query used to detect duplicates?
  5. Can you specify the indexing scheme, table type, table services, number of writers used in parallel, catalog, Hudi version, metadata table status, enabled indexes, Hudi timeline, and partitioning scheme you are utilizing in Hudi 0.6.0 and planning to update to 0.12.1?

Sources:

torvalds-dev[bot] commented 1 month ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/72

torvalds-dev[bot] commented 1 month ago

Thank you for ticket creation.

torvalds-dev[bot] commented 1 month ago

Hello! :wave: Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev[bot] commented 1 month ago

Follow up questions:

  1. Could you describe the nature of your dataset and whether you perform inserts or updates frequently?
  2. Have you considered the options for upgrading selectively, such as converting newer partitions to Hudi or full conversion to Hudi?
  3. Do you have any specific concerns or challenges in migrating from Hudi 0.6.0 to 0.12.1 that you would like to highlight?

Sources:

torvalds-dev[bot] commented 1 month ago

Follow up questions:

  1. Could you describe the nature of your dataset and whether you perform inserts or updates frequently?
  2. Have you considered the options for upgrading selectively, such as converting newer partitions to Hudi or full conversion to Hudi?
  3. Do you have any specific concerns or challenges in migrating from Hudi 0.6.0 to 0.12.1 that you would like to highlight?

Sources: