Closed AymenFJA closed 1 year ago
Merging #2985 (c545c85) into devel (44fde6b) will not change coverage. The diff coverage is
0.00%
.
@@ Coverage Diff @@
## devel #2985 +/- ##
=======================================
Coverage 41.62% 41.62%
=======================================
Files 95 95
Lines 10506 10506
=======================================
Hits 4373 4373
Misses 6133 6133
Impacted Files | Coverage Δ | |
---|---|---|
src/radical/pilot/raptor/master.py | 28.29% <0.00%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
In general, I do not like timeouts as they should be replaced with a suitable coordinate algo. The main issue is: how do we know what is a reasonable time beyond try and error? Anyway, for the specific issue, this should work.
Sure - but the 'proper' solution needs a bit more work. At the moment only rank0 registers with the master which then listens for heartbeats from all ranks for that worker. The correct solution would be for all ranks to register individually. Out of scope for this issue though, as you said.
This PR should fix the issue of workers ranks dying if they do not join within a specific time period. Thus, this PR increases the master time to check on the worker.
related https://github.com/radical-cybertools/other_activities/issues/59