lucidrains / mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
MIT License
628 stars 49 forks source link

PEER implementation #11

Closed huu4ontocord closed 3 months ago

huu4ontocord commented 3 months ago

Hi @lucidrains would you consider implementing the new Deepmind 1M MoE?

Mixture of A Million Experts

Abstract The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the finegrained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.

https://arxiv.org/pdf/2407.04153

From my understanding with some added complexity in retrieving, you can replace FFN with PEER network keeping param count and flop same and still increase performance.

lucidrains commented 3 months ago

@huu4ontocord would you like to pitch this on the discord?