intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.22k stars 153 forks source link

Refactor diagnose agent #1234

Closed samplise closed 1 month ago

samplise commented 1 month ago

What changes were proposed in this pull request?

Refactor the implementation of diagnose agent on worker.

Why are the changes needed?

We have a new design for diagnose system.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 87.74704% with 31 lines in your changes missing coverage. Please review.

Project coverage is 80.39%. Comparing base (1ad45be) to head (d225739). Report is 14 commits behind head on master.

Files Patch % Lines
dlrover/python/elastic_agent/torch/training.py 47.36% 10 Missing :warning:
.../python/elastic_agent/diagnosis/diagnosis_agent.py 85.36% 6 Missing :warning:
...thon/diagnosis/datacollector/cuda_log_collector.py 0.00% 3 Missing :warning:
...n/inferenceoperator/check_failure_node_operator.py 91.42% 3 Missing :warning:
dlrover/python/diagnosis/common/inference_chain.py 87.50% 2 Missing :warning:
...ython/diagnosis/datacollector/metrics_collector.py 0.00% 2 Missing :warning:
.../diagnosis/datacollector/training_log_collector.py 88.23% 2 Missing :warning:
dlrover/python/common/worker.py 93.33% 1 Missing :warning:
dlrover/python/tests/test_diagnosis_agent.py 97.43% 1 Missing :warning:
dlrover/python/tests/test_inference_chain.py 97.50% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1234 +/- ## ========================================== - Coverage 80.46% 80.39% -0.08% ========================================== Files 214 218 +4 Lines 19648 19751 +103 ========================================== + Hits 15810 15879 +69 - Misses 3838 3872 +34 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.