intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Fix heart beat for concurency. #1189

Closed BalaBalaYi closed 4 months ago

BalaBalaYi commented 4 months ago

What changes were proposed in this pull request?

  1. Add lock for heart beat collecting and dead node event judgement.
  2. Add one more condition: heartbeat_time is meaningful when 'heartbeat_time' > 'start_time'.
  3. Remove 'print'.

Why are the changes needed?

Multi worker call master to update heart beat may cause thread safe issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT(TODO).

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 97.82609% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.96%. Comparing base (39a2cf7) to head (a5e3d4d). Report is 2 commits behind head on master.

Files Patch % Lines
dlrover/python/master/stats/stats_backend.py 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1189 +/- ## ========================================== + Coverage 79.90% 79.96% +0.05% ========================================== Files 213 213 Lines 18926 18959 +33 ========================================== + Hits 15123 15160 +37 + Misses 3803 3799 -4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

majieyue commented 4 months ago

lgtm