Open torvalds-dev[bot] opened 3 weeks ago
Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds
Yes, it is possible to speed up S3 Multipart Uploads using Hudi on AWS EMR. When using EMRFS, you may notice Multipart Upload parts being written in serial by default, causing potential performance bottlenecks. However, there is a solution available for speeding up writes by configuring Hudi jars in EMR clusters.
To override Hudi jars in your EMR clusters and potentially enhance write speeds, follow these steps:
Build Hudi Jars:
Copy Jars to S3:
Include Hudi Jars as part of the EMR Bootstrap Script:
Use the Overridden Jar in Deltastreamer:
By following these steps, you can optimize S3 Multipart Uploads using Hudi on AWS EMR, potentially enhancing write performance. (Source: Apache Hudi - FAQ Integrations)
Please configure the GitHub URL to enable the creation of GitHub issues.
Please configure the GitHub URL to enable the creation of GitHub issues.
Please configure the GitHub URL to enable the creation of GitHub issues.
Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/59
Ticket created successfully. Here is the link to the GitHub issue: https://github.com/torvalds-dev/hudi/issues/59
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
Is there a way to speed up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS? Currently, the Multipart Upload parts seem to be written in serial.
To Reproduce
Steps to reproduce the behavior:
1. 2. 3. 4.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.