Open hongshanli23 opened 2 months ago
Here is my own minimal example after reading relevant deepspeed code. Hope this can be helpful for others https://github.com/hongshanli23/DSUC
I still want to wait for some comments from the maintainers before closing this issues
@hongshanli23, this is really great. Do you mind creating a PR in the following location? https://github.com/microsoft/DeepSpeedExamples/tree/master/training/universal_checkpoint
Thanks
Up to this point, I did not find minimal examples to resume training from universal checkpoint. The only example for using universal checkpoint is here which is buried under layer of megatron code.
Describe the solution you'd like A minimal example with only deepspeed and pytorch, so that people using other frameworks like pytorch lightning or huggingface and integrate this solution