How to implement Fairseq-MoE training checkpoint like Swin-MoE?

microsoft / Tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation

MIT License

724 stars 93 forks source link

How to implement Fairseq-MoE training checkpoint like Swin-MoE? #219

Open withinmiaov opened 11 months ago

withinmiaov commented 11 months ago

First, I want to thank the tutel team for open-sourcing this work, it's a very good and practical framework. I want to use tutel's moe in fairseq nlp tasks, but I encountered a problem, the original checkpoint setting of fairseq can't save and load Experts parameters distributed on different GPUs. How should I modify the fairseq model to support checkpoints like Swin-moe?

ghostplant commented 11 months ago

Hi, you may need to rename the save_dir to make per-device process save to a unique destination:

https://github.com/facebookresearch/fairseq/blob/da8fb630880d529ab47e53381c30ddc8ad235216/fairseq/dataclass/configs.py#L645

You can change the default save_dir path to: f"checkpoints-dev{os.environ.get('LOCAL_RANK', 0)}" or f"checkpoints-dev{os.environ.get('RANK', 0)}"