intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.22k stars 153 forks source link

Add signal timeout for 'stop_workers' #1213

Closed BalaBalaYi closed 2 months ago

BalaBalaYi commented 2 months ago

What changes were proposed in this pull request?

Will raise error(pod failover) if the 'stop_workers' timeout.

Why are the changes needed?

'stop_workers' may hang if there is hardware issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.19%. Comparing base (0ef27aa) to head (988fd06).

Files Patch % Lines
dlrover/python/elastic_agent/torch/training.py 87.50% 2 Missing :warning:
...lrover/python/tests/test_elastic_training_agent.py 92.85% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1213 +/- ## ========================================== + Coverage 80.16% 80.19% +0.03% ========================================== Files 217 217 Lines 19195 19220 +25 ========================================== + Hits 15387 15413 +26 + Misses 3808 3807 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

majieyue commented 2 months ago

lGTM