intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.27k stars 167 forks source link

Optimize ckeckpointing. #1235

Closed BalaBalaYi closed 3 months ago

BalaBalaYi commented 3 months ago

What changes were proposed in this pull request?

  1. Add more key logging.
  2. Optimize exists logging.
  3. Annotations&Variables modification.

Why are the changes needed?

Multi optimizing for checkpointing.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.