intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.22k stars 153 forks source link

Sse rdzv timeout as insufficient timeout #1250

Closed BalaBalaYi closed 1 month ago

BalaBalaYi commented 1 month ago

What changes were proposed in this pull request?

  1. Worker report 'join timeout' by report_rdzv_params.
  2. Master saves the 'join timeout' in worker manager.
  3. Master base on the 'join timeout' to setup insufficient timeout.

For back compatibility(worker who don't report the join timeout): Use the default 600s as the 'join timeout'.

For protection: The timeout range is (600s, 3600s).

Why are the changes needed?

Previously, we used the pending timeout period as the criterion for determining insufficiency. However, considering that users may configure the networking timeout period themselves, it is more scientific to directly use the networking timeout period for this configuration.

The current definition of this timeout period is 1.5 times the networking timeout period, which is used to avoid scenarios where errors cannot be recovered within a complete round of the networking timeout period.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 90.62500% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.41%. Comparing base (211903e) to head (6f5984c). Report is 3 commits behind head on master.

Files Patch % Lines
dlrover/python/master/node/worker.py 85.71% 2 Missing :warning:
dlrover/python/master/servicer.py 66.66% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1250 +/- ## ========================================== - Coverage 80.41% 80.41% -0.01% ========================================== Files 218 218 Lines 19800 19820 +20 ========================================== + Hits 15923 15938 +15 - Misses 3877 3882 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.