[BUG] Problems with Mixture-of-Experts (MoE)

Hello,

Thank you for the nice work with this training framework. However, I have noticed that there's a problem with inference, conversion and fine-tuning of MoE based GPT model. The following is a list of issues that point the same but have not been yet addressed:

https://github.com/microsoft/Megatron-DeepSpeed/issues/364
https://github.com/microsoft/Megatron-DeepSpeed/issues/337
https://github.com/microsoft/Megatron-DeepSpeed/issues/300
https://github.com/microsoft/Megatron-DeepSpeed/issues/214

In general, the inference example (generate_text.sh) does not work when --num-experts is set to a value higher than 1. Also, the conversion scripts (convert_checkpoint) are not equipped to handle MoE models.

I would like to request the attention of repository maintainers to this issue. Personally, this issue is being a big roadblock in our research and prevents us from analyzing or publishing our findings. We would be really grateful if this can be resolved soon.

If you need any other information or access to model weights to test, please feel free to ask. With my current knowledge, I can also offer to fix/implement features if you point me in the right direction.

microsoft / Megatron-DeepSpeed

[BUG] Problems with Mixture-of-Experts (MoE) #367