Any experiments about the load balancing loss?

pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)

https://arxiv.org/abs/2406.16554

Apache License 2.0

883 stars 46 forks source link

Closed exhyy closed 1 month ago

exhyy commented 1 month ago

Hi Authors! Thanks for your excellent work!

I have a few questions about this paper:

Was the result in Table 2 obtained using Independent-Random?
I am surprised that Independent-Random achieves the best results. I am wandering if there are any relationships between the results and load balancing loss. Do you have any experiments about it? For example, using the random hashing layer in MoEBERT rather than a trainable router with load balancing loss.

exhyy commented 1 month ago

@Spico197 @DaizeDong

Spico197 commented 1 month ago

Hi there, thanks for your attention to this project~

Yes the best results are obtained by Independent-Random
Since MoE-fy from dense models is not a common setting for MoE training, we kindly refer to Fig. 2 in Wang et al. (2024), which demonstrates that greater $\alpha$ values (lower balance loss) would result to worse perplexities.

exhyy commented 1 month ago

I see. Thanks for your explanations!