intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.19k stars 146 forks source link

Why checkpoint can't be copied to shared memory Asynchronously to shared memory when using Flash Checkpoint? #1187

Closed Reflect0 closed 1 month ago

Reflect0 commented 2 months ago

When using flash checkpoint, I'm wondering why checkpoint can't be copied to shared memory asynchronously. The time can be further reduced if this feature is implemented.

workingloong commented 1 month ago

It is a good idea. Synchronous copy is much simpler than async. What's more, the time to copy across nodes is a little with fast network.