Closed XintianHan closed 1 year ago
RetNet uses DeepNet's derivation methods to obtain the initialization for better training stability, instead of directly re-using its derived initialization (on Post-LN transformers), because the initialization depends on the model architecture according to the theory in DeepNet.
RetNet uses DeepNet's derivation methods to obtain the initialization for better training stability, instead of directly re-using its derived initialization (on Post-LN transformers), because the initialization depends on the model architecture according to the theory in DeepNet.
Thanks for the quick reply!
"because the initialization depends on the model architecture according to the theory in DeepNet"
Could you elaborate the derivation methods more? How do you get the number 2 ** -2.5 here? Thanks
I am also interested in this initialisation scheme. It seems for recurrent models such as S4 and S5, they have different schemes. Do you have any particular explanation or heuristic of this scale?
In the paper, the authors mentioned that the initialization followed DeepNet but from the code, it's kind of different. Why is there a mismatch?