Open withinmiaov opened 11 months ago
Hi, you may need to rename the save_dir to make per-device process save to a unique destination:
You can change the default save_dir path to: f"checkpoints-dev{os.environ.get('LOCAL_RANK', 0)}"
or
f"checkpoints-dev{os.environ.get('RANK', 0)}"
First, I want to thank the tutel team for open-sourcing this work, it's a very good and practical framework. I want to use tutel's moe in fairseq nlp tasks, but I encountered a problem, the original checkpoint setting of fairseq can't save and load Experts parameters distributed on different GPUs. How should I modify the fairseq model to support checkpoints like Swin-moe?