Open yyht opened 4 years ago
@yyht A few questions:
truncated_normal_initializer
with the given range."""
return tf.truncated_normal_initializer(stddev=initializer_range)Try initializing the embedding matrix to uniform distribution drawn from +- 1 / d
.
@calclavia can you give a little more insight into reasoning for this embedding init recommendation? Curious if it's motivated by empirical performance or other theoretical justification.
@sooheon It depends on the particular implementation of your Transformer. Some implementations (Huggingface) scale the embedding by 1 / d before padding it into higher layers while initializing the embedding with a uniform distribution (-1 to + 1). This effectively does the same thing as initializing it as +- 1/d.
The reasoning for this initialization is less to do with our paper - we simply follow what previous work has recommended. I believe the Attention is all your need paper recommended 1/d scaling for attentional softmax (when d is large). By scaling to 1/d, the gradients for the softmax layer is more well behaved.
The same principle is applied to the output softmax when predicting output vocabularies. When Rezero initializes the Transformer layers to zero, it essentially starts off as a pass-through from input embedding directly to output embedding. Having 1/d initialization ensures the gradients as well behaved.
Hi, nice work. When I apply it to shallower bert or gpt, after initialization, it often get NAN gradients(even for deeper architecture).