Open zaptrem opened 4 months ago
Will FastPersist be open-sourced in the next DeepSpeed release ?
@zaptrem, thanks for this request. We currently lack bandwidth to add this feature, but it is noted.
Will FastPersist be open-sourced in the next DeepSpeed release ?
@cailun01, yes, we plan to open-source soon.
Hi @tjruwase
I went through the issue and looking to contribute here, though need some time for more clarification and understanding
Wanted to know if I can take up this as my first issue here or if you have any suggestions lmk :)
Thanks!
@Irene-123, you are welcome to give it a try. But I suspect this requires non-trivial effort and probably not a good first issue.
@zaptrem, are you able to provide guidance on this?
Is your feature request related to a problem? Please describe. Checkpointing is significantly faster with Torch Distributed's async checkpoint feature: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict_saver.async_save
Blog post: https://pytorch.org/blog/reducing-checkpointing-times/
We want to checkpoint frequently, but it is expensive because it blocks training.
Describe the solution you'd like Checkpoints should load the params to CPU, then save the checkpoint while training continues.
Describe alternatives you've considered Nebula is only for Azure users and is also kinda broken. Torch's FSDP appears to perform worse in general (performance and accuracy) compared to DeepSpeed (likely due to differences in your mixed precision implementations I don't quite understand yet).