currently we load the entire hudi instant file into memory before uploading them to the presigned URL, this leads to very high memory usage. Also we have seen that the S3 client has issues which leads to using 3-4x more memory which increases the likelyhood of OOM's : https://github.com/aws/aws-sdk-java-v2/issues/4392
In this Pr, the approach has been modified to use streaming and zero copy buffers to optimise the upload process and completely eliminating the need to ever load the entire instant into memory at any point of time.
the approach has been tested on a 19GB test set in both AWS and GCP environments
currently we load the entire hudi instant file into memory before uploading them to the presigned URL, this leads to very high memory usage. Also we have seen that the S3 client has issues which leads to using 3-4x more memory which increases the likelyhood of OOM's : https://github.com/aws/aws-sdk-java-v2/issues/4392
In this Pr, the approach has been modified to use streaming and zero copy buffers to optimise the upload process and completely eliminating the need to ever load the entire instant into memory at any point of time.
the approach has been tested on a 19GB test set in both AWS and GCP environments