Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted

openaire / iis

Information Inference Service of the OpenAIRE system

Apache License 2.0

20 stars 11 forks source link

2024-11-16 22:38:14,911 [shuffle-client-6-1] ERROR org.apache.spark.network.client.TransportResponseHandler - Still have 1 requests outstanding when connection from eos-m2-sn03.ocean.icm.edu.pl/10.19.65.103:7337 is closed 2024-11-16 22:38:14,912 [dispatcher-event-loop-1] ERROR org.apache.spark.storage.BlockManager - Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.

One viable solution is to replicate the finally block execution on the oozie workflow definition level, similarly to the metadata extraction workflow where lock management logic was handled solely within oozie workflow definition becuase of the MapReduce nature of the metadata extraction job.

CachedWebCrawlerJob in contrary is a spark job so it allowed encoding lock handling logic within the same java class but it turns out it does not guarantee the finally block to be executed due to a possible forced executor kill.

This means introducing the following action:

    <action name="release-lock-and-fail">
        <java>
            <main-class>eu.dnetlib.iis.common.java.ProcessWrapper</main-class>
            <arg>${lock_managing_process}</arg>
            <arg>-Pzk_session_timeout=${zk_session_timeout}</arg>
            <arg>-Pnode_id=${cache_location}</arg>
            <arg>-Pmode=release</arg>
        </java>
        <ok to="fail" />
        <error to="fail" />
    </action>

which should be referenced when the preceding action triggering CachedWebCrawlerJob execution fails.

openaire / iis

Make sure the zookeeper lock obtained by CachedWebCrawlerJob is always released when the job gets interrupted #1492