google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
142 stars 82 forks source link

Fix for the high memory usage #935

Closed chandrashekar-s closed 5 months ago

chandrashekar-s commented 5 months ago

Description of what I changed

Fixes #900

The incremental run used to fail when there were high number of FHIR resources to be processed already present in the system. This was because during the incremental run, the delta records had to be merged with the existing records in the parquet files and this needed reading of all the parquet records.

It turned out to be that the minimum amount of memory needed to read from all the parquet shards in parallel is determined by the below formula

memoryNeeded = Misc memory for JVM stack + (#Parallel Pipeline Threads * #Parallel Pipelines * Parquet Row Group Size)

This was causing failures for the default parquet row size of 128mb which needed a high memory to load these records into in-memory by all the pipeline threads.

Changes are made to reduce the Parquet Row Group Size to 32mb to run the pipelines even on a low resource environment and it is made configurable as well. Also, the application fails fast if the configured params is not sufficient.

E2E test

Ran the pipelines incremental runs and verified the following

TESTED:

Please replace this with a description of how you tested your PR beyond the automated e2e/unit tests.

Checklist: I completed these to help reviewers :)

chandrashekar-s commented 5 months ago

Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type n1-highcpu-32 which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.

Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?

chandrashekar-s commented 5 months ago

Hi @bashir2, the changes are complete. However, the application is failing to launch in the e2e test cases because the machine used in the e2e runs is of type n1-highcpu-32 which I assume is a 32 core machine. But the default values currently only support up to 8 core machine.

Just wanted to get your opinion on whether should the default configurations of application should be increased to support the 32 core machine, which means the Xmx value should be defaulted to more than 6GB, or just add a WARN log and not fail the application to launch?

The other option which I am thinking is to automatically generate the JVM memory parameters based on the threads and parquet row group settings, which might be little non-trivial.

chandrashekar-s commented 5 months ago

In the latest commit, the default JVM memory values have been increased to support machines up to 32 cores. Also, a point to note is that the Parquet Group Row Size has been reduced from a default value of 128mb to 32mb. This is to support running the pipelines without fail even on low resource environment. The down sides of reducing this has been documented in the PR.