MoE in the attn heads - Githubissues

Awesome project! Thank you publishing it! I was just curious about the following:

Why does the SegMoE SD 4x2 model have Mixture of Experts (MoE) layers within their attention heads, while most other models, including the tutorial on Huggingface (https://huggingface.co/blog/moe), typically use MoE layers in the feedforward network (FFN)? What's the distinction between these approaches?

segmind / segmoe

MoE in the attn heads #25