Gate is Linear Layer?!?!

mistralai / mistral-inference

Official inference library for Mistral models

Apache License 2.0

9.37k stars 817 forks source link

Hi! I have nothing to do with mistral but can answer your questions.

Gates or routers are always linear layers, even in switch transformers.

Regular linear layers, or sets of MLPs, are nearly always the experts. Sometimes experts have attention layers too or separate experts for attention. Usually, it is just MLPs and shared attention.

There is an implementation of Mixtral by huggingface (and regular mistral) https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/models/mixtral/modeling_mixtral.py Hope this helps

mistralai / mistral-inference

Gate is Linear Layer?!?! #112