issues
search
microsoft
/
mup
maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k
stars
88
forks
source link
Once the best HPs have been found, does the final model have to be trained with `mup` or can one just use the found HPs and train the model in a standard way?
#53
Closed
ricomnl
closed
1 year ago