Closed wang-zerui closed 1 year ago
Hi Zerui,
You are right, technically this line should be zero initialization. This is now fixed.
For context, in our experience, the zero initialization of the last layer is empirically more important than that of the query head, so it’s not as much of a difference whether the query layer have zero initialization. This is definitely the case for this transformer model in the code, so we didn't change this part of the model (preferring to err on the side of less changes rather than more, so it's easier for people to adopt). But it’s possible that it might be different in your own models.
Ok, I understand. Thank you for you quick reply!
Hello, in your paper, section D.2 says that weights of query layers should be initialized to 0. But in you code, it is like this.
https://github.com/microsoft/mup/blob/cf453c44e7f5da5f7dccdda43605048998d6cc95/examples/Transformer/model.py#L291
The output layer's weight of this MLP code is actually initialized to 0.
Why is it like this? Did I miss some detail in the paper?