intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Skip memory limitation for gpu type node relaunch operation. #1341

Closed BalaBalaYi closed 55 minutes ago

BalaBalaYi commented 5 hours ago

What changes were proposed in this pull request?

  1. Skip the memory limitation judgement if node has gpu resource in relaunch operation.
  2. Add a default limitation for memory adjustment for 'oom case' under torch training.

Why are the changes needed?

There is a memory limitation to avoid invalid resource scaling(out of quota) under tensorflow training case. This limitation should not be involved in torch training for now for all the gpu type trainings nowadays are not scalable.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

codecov[bot] commented 5 hours ago

Codecov Report

Attention: Patch coverage is 97.05882% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.17%. Comparing base (ec94ab6) to head (325e9a2). Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
dlrover/python/master/node/dist_job_manager.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1341 +/- ## ========================================== + Coverage 81.15% 81.17% +0.02% ========================================== Files 231 231 Lines 21965 21988 +23 ========================================== + Hits 17825 17849 +24 + Misses 4140 4139 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.