pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
https://arxiv.org/abs/2406.16554
Apache License 2.0
883 stars 46 forks source link

Performance comparison between LLama-MOE and the original dense model. #49

Closed DoubleVII closed 10 months ago

DoubleVII commented 10 months ago

I appreciate the authors' contribution to MOE construction. The authors mentioned in Section 1 that there is a significant performance drop between the LLaMA-MoE-v1 models and the original dense LLaMA models. I am curious about the performance of the original dense LLaMA model and whether these gaps can be bridged by the approach proposed in the paper.

Spico197 commented 10 months ago

Hi there, thanks for the question~

DoubleVII commented 10 months ago

Hi there, thanks for the question~

  • For llama2-7B model's performances, please refer to Sheared LLaMA and the original LLaMA2 papers for detailed metric scores.
  • For a 3B model, it is really difficult to reach the original 7B dense model's performance, and requires tons of tokens to train the model.

    • Maybe it is possible, but the cost is not negligible.
    • Or, following the scaling law of MoE, we can extend more experts by upcycling, which keeps the number of activated parameters unchanged but scales the total number of parameters. But the number of total parameters may be greater than 7B.

Thank you for your reply to address my concern, I will close this issue.