pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
https://arxiv.org/abs/2406.16554
Apache License 2.0
849 stars 44 forks source link

Partition FFNs without downsizing them? #62

Closed abhinand5 closed 6 months ago

abhinand5 commented 7 months ago

Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.

One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?

Let me know your thoughts!

Spico197 commented 7 months ago

Great idea~ This is similar to Upcycling, a method proposed from Google to expand FFNs into multiple experts.

https://arxiv.org/abs/2212.05055