kyegomez / Mixture-of-Depths

Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
MIT License
56 stars 4 forks source link

[BUG] Lack of weight multiplication #1

Closed hypnopump closed 3 months ago

hypnopump commented 5 months ago

Does not multiply by router weight as in original paper figure (see below)

Captura de Pantalla 2024-04-06 a las 11 44 35

Upvote & Fund

Fund with Polar

github-actions[bot] commented 5 months ago

Hello there, thank you for opening an Issue ! 🙏🏻 The team was notified and they will get back to you asap.

github-actions[bot] commented 3 months ago

Stale issue message