lucidrains / mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
MIT License
628 stars 49 forks source link

Load balancing loss? #10

Closed Aman-Goel1 closed 11 months ago

Aman-Goel1 commented 11 months ago

Hi lucidrains, thanks for the amzing repository. I was wondering where the load balancing loss was? I recall there being two losses, auxillary loss as well as load balancing loss in the 2017 mixture-of-experts paper.

lucidrains commented 11 months ago

@Aman-Goel1 hey Aman, as far as I know, the load balancing loss is a auxiliary loss, and is defined here

if you are carrying out new research, recommend using ST-MoE, which is more complete and up to date. it contains a new router z-loss that helps a lot with stability

Aman-Goel1 commented 11 months ago

@lucidrains Thanks for the swift response. I thought this was the implementation of 2017 Shazeer et al.'s MoE but it's GShard's MoE. That was a confusion on my part.

Also thanks for the ST-MoE repository, I'll be most probably using that instead!