microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.5k stars 4.12k forks source link

[REQUEST] Asynchronous Checkpointing #5721

Open zaptrem opened 4 months ago

zaptrem commented 4 months ago

Is your feature request related to a problem? Please describe. Checkpointing is significantly faster with Torch Distributed's async checkpoint feature: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict_saver.async_save

Blog post: https://pytorch.org/blog/reducing-checkpointing-times/

We want to checkpoint frequently, but it is expensive because it blocks training.

Describe the solution you'd like Checkpoints should load the params to CPU, then save the checkpoint while training continues.

Describe alternatives you've considered Nebula is only for Azure users and is also kinda broken. Torch's FSDP appears to perform worse in general (performance and accuracy) compared to DeepSpeed (likely due to differences in your mixed precision implementations I don't quite understand yet).

cailun01 commented 4 months ago

Will FastPersist be open-sourced in the next DeepSpeed release ?

tjruwase commented 3 months ago

@zaptrem, thanks for this request. We currently lack bandwidth to add this feature, but it is noted.

tjruwase commented 3 months ago

Will FastPersist be open-sourced in the next DeepSpeed release ?

@cailun01, yes, we plan to open-source soon.

Irene-123 commented 3 months ago

Hi @tjruwase
I went through the issue and looking to contribute here, though need some time for more clarification and understanding Wanted to know if I can take up this as my first issue here or if you have any suggestions lmk :) Thanks!

tjruwase commented 2 months ago

@Irene-123, you are welcome to give it a try. But I suspect this requires non-trivial effort and probably not a good first issue.

@zaptrem, are you able to provide guidance on this?