Open marekhorst opened 4 days ago
One viable solution is to replicate the finally
block execution on the oozie workflow definition level, similarly to the metadata extraction workflow where lock management logic was handled solely within oozie workflow definition becuase of the MapReduce nature of the metadata extraction job.
CachedWebCrawlerJob
in contrary is a spark job so it allowed encoding lock handling logic within the same java class but it turns out it does not guarantee the finally
block to be executed due to a possible forced executor kill.
This means introducing the following action:
<action name="release-lock-and-fail">
<java>
<main-class>eu.dnetlib.iis.common.java.ProcessWrapper</main-class>
<arg>${lock_managing_process}</arg>
<arg>-Pzk_session_timeout=${zk_session_timeout}</arg>
<arg>-Pnode_id=${cache_location}</arg>
<arg>-Pmode=release</arg>
</java>
<ok to="fail" />
<error to="fail" />
</action>
which should be referenced when the preceding action triggering CachedWebCrawlerJob
execution fails.
It turned out if the first attempt of the
CachedWebCrawlerJob
failed due to shuffle service connectivity issue:the executor given job was run by is killed in a way the lock release operation defined in
finally
block code:https://github.com/openaire/iis/blob/a3c3c5a59103a3d4d238efa223ba6d3bfb4813d3/iis-common/src/main/java/eu/dnetlib/iis/common/cache/DocumentTextCacheStorageUtils.java#L82
is not executed. This results in a 2nd attempt being stalled while waiting to obtain the lock which was not released by the 1st attempt.
Originally described in: https://support.openaire.eu/issues/10157#note-2