intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Expose ckpt events #1321

Open samplise opened 1 week ago

samplise commented 1 week ago

What changes were proposed in this pull request?

Expose critical ckpt events.

Why are the changes needed?

Track the process and errors during ckpt.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

codecov[bot] commented 5 days ago

Codecov Report

Attention: Patch coverage is 95.34884% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.90%. Comparing base (0cff503) to head (26a5388). Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
dlrover/python/elastic_agent/master_client.py 81.81% 2 Missing :warning:
dlrover/python/elastic_agent/torch/ckpt_saver.py 87.50% 1 Missing :warning:
dlrover/python/master/servicer.py 85.71% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1321 +/- ## ========================================== + Coverage 80.84% 80.90% +0.05% ========================================== Files 229 229 Lines 21574 21677 +103 ========================================== + Hits 17442 17537 +95 - Misses 4132 4140 +8 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.