Open xuwenshen opened 5 years ago
normalization seems different from the paper #attention is all you need#
in paper, normalization layer stays after mha and feed forward layer, in torchnlp, it stays before them
x = inputs # Layer Normalization x_norm = self.layer_norm_mha(x) # Multi-head attention y = self.multi_head_attention(x_norm, x_norm, x_norm) # Dropout and residual x = self.dropout(x + y) # Layer Normalization x_norm = self.layer_norm_ffn(x) # Positionwise Feedforward y = self.positionwise_feed_forward(x_norm) # Dropout and residual y = self.dropout(x + y)
Yes it's from the updated Transformer model. You can find the Tensorflow version maintained by the Authors here
normalization seems different from the paper #attention is all you need#
in paper, normalization layer stays after mha and feed forward layer, in torchnlp, it stays before them