mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.37k stars 817 forks source link

Gate is Linear Layer?!?! #112

Open Eran-BA opened 6 months ago

Eran-BA commented 6 months ago

I have 2 fundamental questions regarding your code in the repository..https://github.com/mistralai/mistral-src/tree/main/mistral/model.py

  1. you implemented a gate such as a gate is a linear layer -- which doesn't make sense at all. because to decide which token is the best to be processed he should receive a transformer, a kind of, gate, (switch transformer maybe?), and not a linear gate.

  2. secondly, you don't use GPTs as experts, but just regular linear layers.

Where is the full code?

lhallee commented 4 months ago

Hi! I have nothing to do with mistral but can answer your questions.

Gates or routers are always linear layers, even in switch transformers.

Regular linear layers, or sets of MLPs, are nearly always the experts. Sometimes experts have attention layers too or separate experts for attention. Usually, it is just MLPs and shared attention.

There is an implementation of Mixtral by huggingface (and regular mistral) https://github.com/huggingface/transformers/blob/v4.38.2/src/transformers/models/mixtral/modeling_mixtral.py Hope this helps