Fix for the high memory usage

chandrashekar-s commented 5 months ago

Description of what I changed

Fixes #900

The incremental run used to fail when there were high number of FHIR resources to be processed already present in the system. This was because during the incremental run, the delta records had to be merged with the existing records in the parquet files and this needed reading of all the parquet records.

It turned out to be that the minimum amount of memory needed to read from all the parquet shards in parallel is determined by the below formula

memoryNeeded = Misc memory for JVM stack + (#Parallel Pipeline Threads * #Parallel Pipelines * Parquet Row Group Size)

This was causing failures for the default parquet row size of 128mb which needed a high memory to load these records into in-memory by all the pipeline threads.

Changes are made to reduce the Parquet Row Group Size to 32mb to run the pipelines even on a low resource environment and it is made configurable as well. Also, the application fails fast if the configured params is not sufficient.

E2E test

Ran the pipelines incremental runs and verified the following

Tested with an initial load of 80K patients already being set and then processed a delta records of 100 patients. The pipeline did not fail and successfully processed the records.

TESTED:

Please replace this with a description of how you tested your PR beyond the automated e2e/unit tests.

Checklist: I completed these to help reviewers :)

[x] I have read and will follow the review process.
[x] I am familiar with Google Style Guides for the language I have coded in.

No? Please take some time and review Java and Python style guides.
[x] My IDE is configured to follow the Google code styles.

No? Unsure? -> configure your IDE.
[ ] I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)
[x] I ran mvn clean package right before creating this pull request and added all formatting changes to my commit.
[x] All new and existing tests passed.
[x] My pull request is based on the latest changes of the master branch.

No? Unsure? -> execute command git pull --rebase upstream master

chandrashekar-s commented 5 months ago

Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type n1-highcpu-32 which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.

Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?

chandrashekar-s commented 5 months ago

Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type n1-highcpu-32 which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.

Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?

The other option which I am thinking is to automatically generate the JVM memory parameters based on the threads and parquet row group settings, which might be little non-trivial.

chandrashekar-s commented 5 months ago

In the latest commit, the default JVM memory values have been increased to support machines up to 32 cores. Also, a point to note is that the Parquet Group Row Size has been reduced from a default value of 128mb to 32mb. This is to support running the pipelines without fail even on low resource environment. The down sides of reducing this has been documented in the PR.

google / fhir-data-pipes