intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Resolve pending and insufficient nodes issue. #1210

Closed BalaBalaYi closed 3 months ago

BalaBalaYi commented 3 months ago

What changes were proposed in this pull request?

  1. Add 'pending nodes' judgement.
  2. Add 'insufficient nodes' judgement.
  3. Optimize 'should_early_stop' function.
  4. Update annotations using.

Why are the changes needed?

Job should early stop in the 2 following case:

  1. Exist pending nodes cause training could not continue.
  2. Insufficient nodes cause training could not continue.

Details of these 2 scenarios can be found in the comments in the code.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT for now. Need more training test later.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 98.11321% with 5 lines in your changes missing coverage. Please review.

Project coverage is 80.36%. Comparing base (592e1c4) to head (d65b721). Report is 25 commits behind head on master.

Files Patch % Lines
dlrover/python/master/node/worker.py 96.61% 2 Missing :warning:
dlrover/python/master/node/dist_job_manager.py 93.75% 1 Missing :warning:
dlrover/python/master/node/job_manager.py 97.50% 1 Missing :warning:
dlrover/python/tests/test_worker_manager.py 98.92% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1210 +/- ## ========================================== + Coverage 79.99% 80.36% +0.37% ========================================== Files 216 217 +1 Lines 19094 19403 +309 ========================================== + Hits 15274 15593 +319 + Misses 3820 3810 -10 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.