radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Feature/raptor resilience #2816

Closed andre-merzky closed 1 year ago

andre-merzky commented 1 year ago

This PR reliably terminates a raptor MPI worker if any worker rank misses to send heartbeats for some time. It also fixes a problem with the popen executor task cancellation which now will kill the executed process group (instead of only the execution script). Finally, it changed the access to the master's self._workers data structures to atomic accesses, thus removing the need for locking that structure - that mitigates some of the performance penalty of introducing the heartbeat recording.

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 43.73%. Comparing base (cb71321) to head (b3911e1). Report is 3530 commits behind head on devel.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## devel #2816 +/- ## ======================================= Coverage 43.73% 43.73% ======================================= Files 83 83 Lines 9187 9187 ======================================= Hits 4018 4018 Misses 5169 5169 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mturilli commented 1 year ago

Should we close this as it is contained in #2819? Note that this is now behind compared to #2819 so careful! :)

andre-merzky commented 1 year ago

Should we close this as it is contained in #2819? Note that this is now behind compared to #2819 so careful! :)

Right - closed.