Closed davisyoshida closed 2 years ago
Hi Davis,
Scaling-wise, all biases, including that of the output layer, should be scaled in the same way as the input layer if we are implementing muP in the style of Table 8. That is because the last layer is calculated as f(x) = Wx / fan_in + b, where the division by fan_in corresponds to the parameter multiplier. To make f(x) change by \Theta(1), the change to b needs to be \Theta(1) as well.
In terms of HP search, it's probably sufficient to just sweep the input/output/hidden split as Greg suggested since the learning rate for biases, unless it's too large and destabilizes the model, doesn't matter too much.
the last layer is calculated as f(x) = Wx / fan_in + b
Ah I had mistakenly implemented (Wx + b) / fan_in, so I was confused. Thanks for the pointers.
Following up on the conversation here https://github.com/microsoft/mup/issues/11 since it wasn't related to the original issue.
How exactly should the learning rates be split up when doing hyperparameter search on the base model? You said input/hidden/output, but Table 8 groups all biases with input weights. Does the output bias also fall into the input/bias group? (it has finite fan-in and fan-out, unlike the other biases which have infinite fan-out)