intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

fix process leak in ascend npu #1331

Closed majieyue closed 3 days ago

majieyue commented 5 days ago

What changes were proposed in this pull request?

In some rare cases, the sub-processes of workers may become an orphan process, when workers are being stopped. We need to clean up these orphan processes in the end

Why are the changes needed?

The orphan process may take over resources and make the npu not available

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT and BVT

codecov[bot] commented 4 days ago

Codecov Report

Attention: Patch coverage is 78.18182% with 36 lines in your changes missing coverage. Please review.

Project coverage is 81.11%. Comparing base (07b18ac) to head (c68f384). Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
dlrover/python/tests/orphan_process.py 0.00% 24 Missing :warning:
...lrover/python/tests/test_elastic_training_agent.py 89.18% 8 Missing :warning:
dlrover/python/elastic_agent/torch/training.py 87.09% 4 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1331 +/- ## ========================================== + Coverage 81.03% 81.11% +0.07% ========================================== Files 230 231 +1 Lines 21788 21949 +161 ========================================== + Hits 17656 17804 +148 - Misses 4132 4145 +13 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.