intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.22k stars 153 forks source link

Add failure reporting for async ckpt saver. #1196

Closed BalaBalaYi closed 2 months ago

BalaBalaYi commented 2 months ago

What changes were proposed in this pull request?

Use 'master_client' to report ckpt failures in the async thread implementation of async ckpt saver.

Why are the changes needed?

To catch and expose possible problems during saving ckpt in async thread.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.

Project coverage is 79.99%. Comparing base (8862fa5) to head (15c48d3). Report is 3 commits behind head on master.

Files Patch % Lines
dlrover/python/elastic_agent/torch/ckpt_saver.py 83.33% 6 Missing :warning:
...lrover/python/tests/test_elastic_training_agent.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1196 +/- ## ========================================== + Coverage 79.97% 79.99% +0.01% ========================================== Files 215 216 +1 Lines 19040 19094 +54 ========================================== + Hits 15227 15274 +47 - Misses 3813 3820 +7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Eugene1518 commented 2 months ago

I have received your email.Thanks~