microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Once the best HPs have been found, does the final model have to be trained with `mup` or can one just use the found HPs and train the model in a standard way? #53

Closed ricomnl closed 1 year ago