Closed DoubleVII closed 10 months ago
Hi there, thanks for the question~
Hi there, thanks for the question~
- For llama2-7B model's performances, please refer to Sheared LLaMA and the original LLaMA2 papers for detailed metric scores.
For a 3B model, it is really difficult to reach the original 7B dense model's performance, and requires tons of tokens to train the model.
- Maybe it is possible, but the cost is not negligible.
- Or, following the scaling law of MoE, we can extend more experts by upcycling, which keeps the number of activated parameters unchanged but scales the total number of parameters. But the number of total parameters may be greater than 7B.
Thank you for your reply to address my concern, I will close this issue.
I appreciate the authors' contribution to MOE construction. The authors mentioned in Section 1 that there is a significant performance drop between the LLaMA-MoE-v1 models and the original dense LLaMA models. I am curious about the performance of the original dense LLaMA model and whether these gaps can be bridged by the approach proposed in the paper.