Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.
One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?
Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.
One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?
Let me know your thoughts!