intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Fix path creation in fsdp dcp saver. #1251

Closed BalaBalaYi closed 2 months ago

BalaBalaYi commented 2 months ago

What changes were proposed in this pull request?

Wait for the path creation for other ranks worker.

Why are the changes needed?

Other rank need to wait for the path creation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT. Training test later.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 33.33333% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.41%. Comparing base (211903e) to head (854bda1). Report is 1 commits behind head on master.

Files Patch % Lines
dlrover/python/elastic_agent/torch/ckpt_saver.py 0.00% 2 Missing :warning:
dlrover/python/master/resource/job.py 50.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1251 +/- ## ========================================== - Coverage 80.41% 80.41% -0.01% ========================================== Files 218 218 Lines 19800 19802 +2 ========================================== Hits 15923 15923 - Misses 3877 3879 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.