radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

increase raptor's heartbeat time #2985

Closed AymenFJA closed 1 year ago

AymenFJA commented 1 year ago

This PR should fix the issue of workers ranks dying if they do not join within a specific time period. Thus, this PR increases the master time to check on the worker.

related https://github.com/radical-cybertools/other_activities/issues/59

codecov[bot] commented 1 year ago

Codecov Report

Merging #2985 (c545c85) into devel (44fde6b) will not change coverage. The diff coverage is 0.00%.

@@           Coverage Diff           @@
##            devel    #2985   +/-   ##
=======================================
  Coverage   41.62%   41.62%           
=======================================
  Files          95       95           
  Lines       10506    10506           
=======================================
  Hits         4373     4373           
  Misses       6133     6133           
Impacted Files Coverage Δ
src/radical/pilot/raptor/master.py 28.29% <0.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

andre-merzky commented 1 year ago

In general, I do not like timeouts as they should be replaced with a suitable coordinate algo. The main issue is: how do we know what is a reasonable time beyond try and error? Anyway, for the specific issue, this should work.

Sure - but the 'proper' solution needs a bit more work. At the moment only rank0 registers with the master which then listens for heartbeats from all ranks for that worker. The correct solution would be for all ranks to register individually. Out of scope for this issue though, as you said.