Closed andre-merzky closed 1 year ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 43.73%. Comparing base (
cb71321
) to head (b3911e1
). Report is 3530 commits behind head on devel.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Should we close this as it is contained in #2819? Note that this is now behind compared to #2819 so careful! :)
Should we close this as it is contained in #2819? Note that this is now behind compared to #2819 so careful! :)
Right - closed.
This PR reliably terminates a raptor MPI worker if any worker rank misses to send heartbeats for some time. It also fixes a problem with the
popen
executor task cancellation which now will kill the executed process group (instead of only the execution script). Finally, it changed the access to the master'sself._workers
data structures to atomic accesses, thus removing the need for locking that structure - that mitigates some of the performance penalty of introducing the heartbeat recording.