torvalds-dev / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Speeding up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS #59

Open torvalds-dev[bot] opened 3 weeks ago

torvalds-dev[bot] commented 3 weeks ago

Tips before filing an issue

Describe the problem you faced

Is there a way to speed up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS? Currently, the Multipart Upload parts seem to be written in serial.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev[bot] commented 3 weeks ago

Relevant Sources:

https://www.onehouse.ai/blog/apache-hudi-native-aws-integrations

https://medium.com/@devjain1299/apache-hudi-on-aws-glue-a-step-by-step-guide-503c34a9aa95

https://aws.amazon.com/blogs/big-data/part-1-query-an-apache-hudi-dataset-in-an-amazon-s3-data-lake-with-amazon-athena-part-1-read-optimized-queries/

https://www.onehouse.ai/blog/reliable-ingestion-from-aws-s3-using-apache-hudi

https://github.com/apache/hudi/issues/10358

torvalds-dev[bot] commented 3 weeks ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev[bot] commented 3 weeks ago

Yes, it is possible to speed up S3 Multipart Uploads using Hudi on AWS EMR. When using EMRFS, you may notice Multipart Upload parts being written in serial by default, causing potential performance bottlenecks. However, there is a solution available for speeding up writes by configuring Hudi jars in EMR clusters.

To override Hudi jars in your EMR clusters and potentially enhance write speeds, follow these steps:

  1. Build Hudi Jars:

    • Clone the Hudi repository and get version 0.7.0.
    • Build jars with the required configurations for EMR clusters, such as Spark 3.0.0 and Scala 2.12.
  2. Copy Jars to S3:

    • After building the jars, copy them to a temporary location in S3.
  3. Include Hudi Jars as part of the EMR Bootstrap Script:

    • Create a script that downloads the Hudi jars from the S3 location to each node in the EMR cluster.
    • Use the script as a part of the bootstrap-actions while launching the EMR cluster to install the jars on each node.
  4. Use the Overridden Jar in Deltastreamer:

    • When invoking DeltaStreamer, specify the location of the overridden jar in the spark-submit command.

By following these steps, you can optimize S3 Multipart Uploads using Hudi on AWS EMR, potentially enhancing write performance. (Source: Apache Hudi - FAQ Integrations)

torvalds-dev[bot] commented 3 weeks ago

Please configure the GitHub URL to enable the creation of GitHub issues.

torvalds-dev[bot] commented 3 weeks ago

Please configure the GitHub URL to enable the creation of GitHub issues.

torvalds-dev[bot] commented 3 weeks ago

Please configure the GitHub URL to enable the creation of GitHub issues.

torvalds-dev[bot] commented 3 weeks ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/59

torvalds-dev[bot] commented 3 weeks ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/59