Deploying Spark fails with insufficient max JVM memory

jakubadamek commented 4 months ago

I was testing a simple developer setup for pipelines:

cd ~/gits
git clone https://github.com/google/fhir-data-pipes.git 
cd fhir-data-pipes/
chmod a+rx docker/dwh
cd docker/
docker network create cloudbuild
docker-compose  -f hapi-compose.yml up  --force-recreate -d
curl -H "Content-Type: application/json; charset=utf-8"   'http://localhost:8091/fhir/Patient' -v
cd ..
python3 -m venv venv
venv/bin/pip install -r ./synthea-hiv/uploader/requirements.txt
venv/bin/python3 synthea-hiv/uploader/main.py HAPI   http://localhost:8091/fhir --input_dir   ./synthea-hiv/sample_data_small/

This finished with some errors which I ignored: https://paste.googleplex.com/5983969268465664

docker-compose -f docker/compose-controller-spark-sql-single.yaml up --force-recreate

Finished with error due to JVM max memory: https://paste.googleplex.com/4827520718864384

Fixed by editing docker/.env JAVA_OPTS=-Xms10g -Xmx10g

Should we make this change permanent?

bashir2 commented 4 months ago

Yes this is a known issue but we prefer not to increase the default memory significantly. To see the reasoning, please take a look at where JAVA_OPTS is set here and the comments above it. So I am guessing that you are running this on a machine with a lot of cores (how many?). We cannot change JVM memory configuration in the controller code but we can check in advance and fail early if we know that the big number of threads is going to fail the pipeline later on.

Another option is to limit the number of threads dynamically, i.e., even if numThreads is set to use all cores (e.g., here) we can override it and set it to a smaller number where the provided memory is enough. That way the pipeline won't fail but may suffer performance-wise; @chandrashekar-s WDYT about this capping change?

chandrashekar-s commented 4 months ago

Yes this is a known issue but we prefer not to increase the default memory significantly. To see the reasoning, please take a look at where JAVA_OPTS is set here and the comments above it. So I am guessing that you are running this on a machine with a lot of cores (how many?). We cannot change JVM memory configuration in the controller code but we can check in advance and fail early if we know that the big number of threads is going to fail the pipeline later on.

Another option is to limit the number of threads dynamically, i.e., even if numThreads is set to use all cores (e.g., here) we can override it and set it to a smaller number where the provided memory is enough. That way the pipeline won't fail but may suffer performance-wise; @chandrashekar-s WDYT about this capping change?

This is a good idea to dynamically set the number of cores, if the configured memory is insufficient and when the numThreads is set to use all cores (or say optimum value). This way we don't fail the application, but we should also warn the user about the same in the logs.

google / fhir-data-pipes

Deploying Spark fails with insufficient max JVM memory #981