onehouseinc / LakeView

Monitoring and insights on your data lakehouse tables
Apache License 2.0
15 stars 3 forks source link

[ENG-12847] Optimise metadata extractor to prevent OOM #94

Closed sampan-s-nayak closed 3 weeks ago

sampan-s-nayak commented 4 weeks ago

currently we load the entire hudi instant file into memory before uploading them to the presigned URL, this leads to very high memory usage. Also we have seen that the S3 client has issues which leads to using 3-4x more memory which increases the likelyhood of OOM's : https://github.com/aws/aws-sdk-java-v2/issues/4392

In this Pr, the approach has been modified to use streaming and zero copy buffers to optimise the upload process and completely eliminating the need to ever load the entire instant into memory at any point of time.

the approach has been tested on a 19GB test set in both AWS and GCP environments

nimahajan commented 4 weeks ago

Task linked: ENG-12847 Optimise metadata extractor jvm params to prevent heap space issues

sonarcloud[bot] commented 3 weeks ago

Quality Gate Passed Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
87.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud