microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.49k stars 4.12k forks source link

[REQUEST] A minimal example to load universal checkpoint #6548

Open hongshanli23 opened 2 months ago

hongshanli23 commented 2 months ago

Up to this point, I did not find minimal examples to resume training from universal checkpoint. The only example for using universal checkpoint is here which is buried under layer of megatron code.

Describe the solution you'd like A minimal example with only deepspeed and pytorch, so that people using other frameworks like pytorch lightning or huggingface and integrate this solution

hongshanli23 commented 1 month ago

Here is my own minimal example after reading relevant deepspeed code. Hope this can be helpful for others https://github.com/hongshanli23/DSUC

I still want to wait for some comments from the maintainers before closing this issues

tjruwase commented 1 month ago

@hongshanli23, this is really great. Do you mind creating a PR in the following location? https://github.com/microsoft/DeepSpeedExamples/tree/master/training/universal_checkpoint

Thanks