microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

LayerNorm Gain and Bias Multipliers #28

Closed AWildridge closed 1 year ago

AWildridge commented 1 year ago

Hi, I'm wondering how to correctly implement the LayerNorm gain and bias multipliers.

In section F.3 it is detailed that a hyperparameter search on LayerNorm gain multiplier and bias multiplier is done in addition to the normal output multiplier and attention logits multiplier. However, I cannot find any example of this being done in the examples/Transformers/models.py module. I also checked the mutransformers repo and checked the BERT example and could also find no evidence of how to correctly implement a layer norm gain multiplier and bias multiplier. Sorry if I somehow managed to miss this.

In the 3rd paragraph in Appendix A you note that any parameter tensor in a neural network can be multiplied by a constant c where c is defined as a parameter multiplier. Therefore, I think it is reasonable to deduce that the correct implementation of a LayerNorm gain multiplier is simply: layernorm_gain_multiplier * torch.nn.LayerNorm(x) Is this correct?

For the bias multiplier it is not so clear to me how this is correctly implemented. Of course one can just say it would be: torch.nn.Linear(x, bias=False) + bias * bias_multiplier but if this is the correct implementation it's not clear to me why: 1) Why is the bias multiplier only being applied to the bias and not the entire linear transformation? 2) Should this same bias multiplier be applied to all linear transformations? From Table 8 it seems to suggest there's nothing stopping you from doing this. 3) Should this bias multiplier be applied to output weights that have an output multiplier being applied to it already? Should the bias multiplier in this case effectively be bias_multiplier * output_multiplier ? 4) Should this bias multiplier be applied to other terms that have a bias-like term? For example, batch and layer normalization.

Thank you for this wonderful paper and excellent repo detailing the correct implementation and cross-checking via coord checking! -AJ

thegregyang commented 1 year ago

Hi AJ,

Thanks for your interest!

If you look at Lemma J.1 of the paper, you’ll see that: for every parameter tensor, out of the 3 hyperparameters [learning rate, initialization, and multiplier], there is a 1-dimensional redundancy. So the choice we made in these repos in general is to absorb the multipliers into the pair [learning rate, initialization]. This is because, from a user interface perspective, practitioners tend to be more comfortable with modifying learning rate and initialization of a fixed model than to modify the model itself by inserting multipliers.

Now, in these examples, we did not explicitly implement separate learning rate and initialization hyperparameters for gain and bias parameters either (other than a more global learning rate and initialization). This is just because we want the code to be as simple as possible as a readable example, as a way of illustrating the general principle at place.

In any case, it is easy to insert learning rate and initialization hyperparameters via the usual ways (parameter groups in optimizers and module.apply function for initialization (you can look at the muTransformer repo for examples of the latter)). If you aren’t familiar with these, feel free to follow up with questions.

Therefore, I think it is reasonable to deduce that the correct implementation of a LayerNorm gain multiplier is simply: layernorm_gain_multiplier * torch.nn.LayerNorm(x) Is this correct?

Because LayerNorm applies both gain and bias, this multiplier would simultaneously be applied to both gain and bias as well. In principle, you should have separate multipliers for gain and for bias, torch.nn.LayerNorm(x, affine=False) * gain_mult * learnable_gain + bias_mult * learnable_bias for the purpose of accurate hyperparameter transfer, but probably tying them together isn’t much worse.

Why is the bias multiplier only being applied to the bias and not the entire linear transformation?

You can have an additional weight multiplier:

torch.nn.Linear(x, bias=False) * weight_mult + bias * bias_mult

If you tie weight_mult and bias_mult together, you’ll have what you are suggesting.

One way to think about this and all of your related questions is as follows:

The true hyperparameter space here is the very high dimensional space containing [learning rate, initialization] (we can insert multipliers here as well, but like I said, it is redundant) for every parameter tensor (weights, biases, gains, etc). If you were to tune all these hyperparameters and obtain the optimal combination, then this combination is guaranteed to be stable in some sense as you vary width (in muP). However, in practice, we may not want to tune that many hyperparameters because of resource constraints. So we combine hyperparameters (by e.g., tying learning rate for many weights together) until we have only a small number to tune. This essentially means that we are now focusing on a low dimensional slice of the true hyperparameter space --- that we guess should contain all the really good hyperparameters. The choices of hyperparameters we tuned in our paper exemplify the “low dimensional slice” we chose. These choices are based on our empirical experience tuning hyperparameters, but over time people may find better choices.

Should this same bias multiplier be applied to all linear transformations? From Table 8 it seems to suggest there's nothing stopping you from doing this.

Following the explanation above, if you believe the optimal hyperparameter combination will have all the bias multipliers tied, then this is a reasonable guess. In general, from my experience, as a starting point, I would separate the hyperparameters of the hidden layers from those of the input and output layers.

Should this bias multiplier be applied to output weights that have an output multiplier being applied to it already? Should the bias multiplier in this case effectively be bias_multiplier * output_multiplier ?

As above, I would separate the output layer from other layers. You can implement it like you suggested.

Should this bias multiplier be applied to other terms that have a bias-like term? For example, batch and layer normalization.

You can. I don’t have a huge preference here. If you have more resources for tuning you can separate out the bias for BN and LN from the layer biases.

thegregyang commented 1 year ago

Seems like this is solved for now. Feel free to re-open if you have further problems.