segmind / segmoe

Apache License 2.0
410 stars 25 forks source link

MoE in the attn heads #25

Open nwaftp23 opened 6 months ago

nwaftp23 commented 6 months ago

Awesome project! Thank you publishing it! I was just curious about the following:

Why does the SegMoE SD 4x2 model have Mixture of Experts (MoE) layers within their attention heads, while most other models, including the tutorial on Huggingface (https://huggingface.co/blog/moe), typically use MoE layers in the feedforward network (FFN)? What's the distinction between these approaches?