Closed chandrashekar-s closed 5 months ago
Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type n1-highcpu-32
which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.
Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?
Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type
n1-highcpu-32
which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?
The other option which I am thinking is to automatically generate the JVM memory parameters based on the threads and parquet row group settings
, which might be little non-trivial.
In the latest commit, the default JVM memory values have been increased to support machines up to 32 cores. Also, a point to note is that the Parquet Group Row Size has been reduced from a default value of 128mb to 32mb. This is to support running the pipelines without fail even on low resource environment. The down sides of reducing this has been documented in the PR.
Description of what I changed
Fixes #900
The incremental run used to fail when there were high number of FHIR resources to be processed already present in the system. This was because during the incremental run, the delta records had to be merged with the existing records in the parquet files and this needed reading of all the parquet records.
It turned out to be that the minimum amount of memory needed to read from all the parquet shards in parallel is determined by the below formula
memoryNeeded = Misc memory for JVM stack + (#Parallel Pipeline Threads * #Parallel Pipelines * Parquet Row Group Size)
This was causing failures for the default parquet row size of 128mb which needed a high memory to load these records into in-memory by all the pipeline threads.
Changes are made to reduce the Parquet Row Group Size to 32mb to run the pipelines even on a low resource environment and it is made configurable as well. Also, the application fails fast if the configured params is not sufficient.
E2E test
Ran the pipelines incremental runs and verified the following
TESTED:
Please replace this with a description of how you tested your PR beyond the automated e2e/unit tests.
Checklist: I completed these to help reviewers :)
[x] I have read and will follow the review process.
[x] I am familiar with Google Style Guides for the language I have coded in.
No? Please take some time and review Java and Python style guides.
[x] My IDE is configured to follow the Google code styles.
No? Unsure? -> configure your IDE.
[ ] I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)
[x] I ran
mvn clean package
right before creating this pull request and added all formatting changes to my commit.[x] All new and existing tests passed.
[x] My pull request is based on the latest changes of the master branch.
No? Unsure? -> execute command
git pull --rebase upstream master