Add Mixture of Tokens model

Model description

Mixture of Tokens is a new architecture / technique proposed in Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation and accompanying blog by Szymon Antoniak, Sebastian Jaszczur et al.

It builds on expert-choice MoE, aggregating across sequences in a batch rather than positions in a sequence, and doing so in a continuous fashion. This full differentiability is its main advantage, bringing training stability and even expert utilization.

In collaboration with the authors, we (me + 3 others) would like to add a PyTorch implementation matching the architecture from the paper to HF transformers and later publish corresponding checkpoints. We believe this will make it significantly easier for the community to experiment with this approach, as the original implementation is quite dense and contained in an active research repo.

We believe a good approach is to start from the GPT2 HF model. We will have the assistance of the original authors for making sure the details match.

Please advise:

If you have any general suggestions at this stage
What kinds of tests you would like to see in the finalized implementation for this case, where the exact snapshot corresponding to the paper's implementation and the checkpoints were not previously published.
If you have general suggestions regarding contributing methods that are potentially applicable to multiple base models (like MoE and MoT).

As we understand, the next step is for us to create a template with https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model and get coding.

Open source status

[X] The model implementation is available
[ ] The model weights are available

Provide useful links for the implementation

https://github.com/llm-random/llm-random

https://github.com/sebastianjaszczur

https://llm-random.github.io/posts/mixture_of_tokens/

huggingface / transformers