microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.25k stars 90 forks source link

Hyperparameter search on base models #12

Closed davisyoshida closed 2 years ago

davisyoshida commented 2 years ago

Following up on the conversation here https://github.com/microsoft/mup/issues/11 since it wasn't related to the original issue.

How exactly should the learning rates be split up when doing hyperparameter search on the base model? You said input/hidden/output, but Table 8 groups all biases with input weights. Does the output bias also fall into the input/bias group? (it has finite fan-in and fan-out, unlike the other biases which have infinite fan-out)

edwardjhu commented 2 years ago

Hi Davis,

Scaling-wise, all biases, including that of the output layer, should be scaled in the same way as the input layer if we are implementing muP in the style of Table 8. That is because the last layer is calculated as f(x) = Wx / fan_in + b, where the division by fan_in corresponds to the parameter multiplier. To make f(x) change by \Theta(1), the change to b needs to be \Theta(1) as well.

In terms of HP search, it's probably sufficient to just sweep the input/output/hidden split as Greg suggested since the learning rate for biases, unless it's too large and destabilizes the model, doesn't matter too much.

davisyoshida commented 2 years ago

the last layer is calculated as f(x) = Wx / fan_in + b

Ah I had mistakenly implemented (Wx + b) / fan_in, so I was confused. Thanks for the pointers.