intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.22k stars 153 forks source link

Worker get elastic run config from master #1277

Open samplise opened 5 days ago

samplise commented 5 days ago

What changes were proposed in this pull request?

Before starting training, workers read elastic run configures from the master.

Why are the changes needed?

We can distribute run configures from the master.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

codecov[bot] commented 5 days ago

Codecov Report

Attention: Patch coverage is 96.47059% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.66%. Comparing base (97f39dc) to head (99a2482). Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
dlrover/python/master/node/training_node.py 76.92% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1277 +/- ## ========================================== + Coverage 80.59% 80.66% +0.06% ========================================== Files 219 219 Lines 20058 20140 +82 ========================================== + Hits 16165 16245 +80 - Misses 3893 3895 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

workingloong commented 5 days ago

Does the PR support updating the training configurations without restarting the job?

samplise commented 4 days ago

Does the PR support updating the training configurations without restarting the job?

No. This pr allows to configure all jobs in a cluster with the same configuration in a simple way. Otherwise, we have to ask all users to have the same configuration in their dlrover-run command line.