Was the result in Table 2 obtained using Independent-Random?
I am surprised that Independent-Random achieves the best results. I am wandering if there are any relationships between the results and load balancing loss. Do you have any experiments about it? For example, using the random hashing layer in MoEBERT rather than a trainable router with load balancing loss.
Hi there, thanks for your attention to this project~
Yes the best results are obtained by Independent-Random
Since MoE-fy from dense models is not a common setting for MoE training, we kindly refer to Fig. 2 in Wang et al. (2024), which demonstrates that greater $\alpha$ values (lower balance loss) would result to worse perplexities.
Hi Authors! Thanks for your excellent work!
I have a few questions about this paper:
Independent-Random
?Independent-Random
achieves the best results. I am wandering if there are any relationships between the results and load balancing loss. Do you have any experiments about it? For example, using the random hashing layer in MoEBERT rather than a trainable router with load balancing loss.