microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

MuP for Mamba #74

Open norikazu99 opened 1 month ago

norikazu99 commented 1 month ago

Hello and thank you for sharing the repo. I'd like to know if muP would work out of the box with Mamba model or I would have to rescale some constants like in transformers attention_scores?