microsoft / mup

maximal update parametrization (µP)
https://arxiv.org/abs/2203.03466
MIT License
1.24k stars 88 forks source link

Should query layers in self-attention be initialized to 0 in practice? #36

Closed wang-zerui closed 1 year ago

wang-zerui commented 1 year ago

Hello, in your paper, section D.2 says that weights of query layers should be initialized to 0. But in you code, it is like this.

https://github.com/microsoft/mup/blob/cf453c44e7f5da5f7dccdda43605048998d6cc95/examples/Transformer/model.py#L291

The output layer's weight of this MLP code is actually initialized to 0.

Why is it like this? Did I miss some detail in the paper?

thegregyang commented 1 year ago

Hi Zerui,

You are right, technically this line should be zero initialization. This is now fixed.

For context, in our experience, the zero initialization of the last layer is empirically more important than that of the query head, so it’s not as much of a difference whether the query layer have zero initialization. This is definitely the case for this transformer model in the code, so we didn't change this part of the model (preferring to err on the side of less changes rather than more, so it's easier for people to adopt). But it’s possible that it might be different in your own models.

wang-zerui commented 1 year ago

Ok, I understand. Thank you for you quick reply!