Awesome project! Thank you publishing it! I was just curious about the following:
Why does the SegMoE SD 4x2 model have Mixture of Experts (MoE) layers within their attention heads, while most other models, including the tutorial on Huggingface (https://huggingface.co/blog/moe), typically use MoE layers in the feedforward network (FFN)? What's the distinction between these approaches?
Awesome project! Thank you publishing it! I was just curious about the following:
Why does the SegMoE SD 4x2 model have Mixture of Experts (MoE) layers within their attention heads, while most other models, including the tutorial on Huggingface (https://huggingface.co/blog/moe), typically use MoE layers in the feedforward network (FFN)? What's the distinction between these approaches?