openaire / iis

Information Inference Service of the OpenAIRE system
Apache License 2.0
20 stars 11 forks source link

Citation matching failed due to an executor exceeding memory limits #1427

Closed marekhorst closed 1 year ago

marekhorst commented 1 year ago

Originally requested in: https://support.openaire.eu/issues/8966

Direct citation matching phase failed on BETA with:

Job aborted due to stage failure: Task 51 in stage 5.0 failed 4 times, most recent failure: Lost task 51.3 in stage 5.0 (TID 37918, eos-m2-sn05.ocean.icm.edu.pl, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

Already deployed IIS workflow was hotpatched but we should provide a long term solution by preparing direct_citationmatching and primary_processing workflows to allow executor memory overhead to be adjusted at runtime.

After we find out the new memory related config it should be committed to the gitlab repo at ICM where we keep the config-default.xml file template.

marekhorst commented 1 year ago

After hotpatching direct citation matching one of the subsequent fuzzy citation matching steps (transformation) has failed with similar error:

Container killed by YARN for exceeding memory limits. 11.3 GB of 11 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

which means more than one module needs to have memory configuration adjusted.

The simplest way is to reuse already defined sparkExecutorOverhead parameter (declared for fuzzy citation matching phase) and define this value for input transfmer phase by setting:

                --conf spark.yarn.executor.memoryOverhead=${sparkExecutorOverhead}

among citation-matching-input-transformer job spark-opts.