layer6ai-labs / T-Fixup

Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"
MIT License
89 stars 11 forks source link

Details for initializing FFN (MLP blocks)? #5

Closed zhuchen03 closed 1 year ago

zhuchen03 commented 3 years ago

Hi, thanks for sharing the code.

I am wondering why the norm of the two FC layers for the MLP blocks are set to (9N)^{-1/4}. I feel it should be something like (9N)^{-1/2} if a similar analysis as Theorem 3.1 is applied to each of the MLP layers.

risingdhxs commented 3 years ago

I think -1/4 in the exponent is correct. Each FC layer's weight (say w1 and w2) will be squared, as in 3.1's calculation, which gives -1/2 in the exponent. Then |w1|^2 * |w2|^2 will give the desired -1 in the exponent.