intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Fix heartbeat when there is node relaunched. #1200

Closed BalaBalaYi closed 4 months ago

BalaBalaYi commented 4 months ago

What changes were proposed in this pull request?

  1. When new node is relaunched, the new node object should reset the heartbeat to 0.
  2. Optimize the logging.

Why are the changes needed?

Fix heartbeat issue when there is node relaunched.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT and full training test.

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 79.97%. Comparing base (5d0c789) to head (22d5ee9).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1200 +/- ## ========================================== + Coverage 79.95% 79.97% +0.01% ========================================== Files 215 215 Lines 19026 19040 +14 ========================================== + Hits 15213 15227 +14 Misses 3813 3813 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.