Example on saving experts to one model when using distributed training

microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation

MIT License

723 stars 93 forks source link

Open Luodian opened 2 years ago

Luodian commented 2 years ago

Hi Thanks for providing such a wonderful codebase.

I have seen and used the save & load in MoE on multiple GPUs, now I can save them on different ranks. But is there away to convert them to one model?

Say, I trained a 8 experts MoE on 8 GPUs, and now I want to do next stage inference on 1 GPUs.

Will you consider provide an example on doing so? or could you provide some ideas on how to implement it myself.

ghostplant commented 2 years ago

A dup request of #177. We are going to add some utility functions to help with this conversion.

Luodian commented 2 years ago

thanks! I think it's worthy doing.