intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Support action timeout processing. #1212

Closed BalaBalaYi closed 3 months ago

BalaBalaYi commented 3 months ago

What changes were proposed in this pull request?

  1. Add new logic in torch-training-monitor to judge 'is action timeout'.
  2. Add time setting around 'stop' action to record cost.
  3. Will raise error(pod failover) if there is action timeout.

Why are the changes needed?

Explain the purpose or motivation behind these changes. What problem are you trying to solve?

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 84.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 80.14%. Comparing base (26797d7) to head (6447d54). Report is 12 commits behind head on master.

Files Patch % Lines
dlrover/python/elastic_agent/torch/training.py 50.00% 7 Missing :warning:
dlrover/python/elastic_agent/monitor/training.py 85.18% 4 Missing :warning:
dlrover/python/tests/test_agent_monitor.py 97.05% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1212 +/- ## ========================================== + Coverage 80.05% 80.14% +0.08% ========================================== Files 217 217 Lines 19149 19234 +85 ========================================== + Hits 15330 15415 +85 Misses 3819 3819 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.