sinusoidal position embedding is added to word embedding matrix to give the model a sense of which portion of the sequence in the input or output it is currently dealing with
Convolutional Block Structure
each convolution kernel takes k input elements embedded in d dimension and map it to 2d element as output
Gated Linear Unit serves as a gating mechanism with non-linearity and point-wise multiplication which outputs d element as output
add residual connection for deep architecture
for encoder, padding is added to ensure the input length matches the output length
for decoder, masking is done to hide future information
add linear mapping to project between the embedding size f and convolution outputs of size 2d. apply such a transformation when feeding embeddings to encoder network, to the encoder output, to the final layer of decoder just before softmax, to all decoder layers gbefore computing attention scores
softmax layer
Multi-step Attention
introduce separate attention mechanism for each decoder layer (multi-hop attention)
for decoder layer l, the attention is computed as a dot-product between decoder state summary d_l and each output z of the last encoder block
add input element embedding e_j to weighted sum calculation, because e_j provides point information about a specific input element that is useful when making a prediction
Normalization Strategy
careful normalization schemes are used to ensure that the variance does not propagate through the network
multiply sqrt(0.5) to output of residual block
conditional input c for attention is scaled by m * sqrt(1/m)
for decoder with multiple attention, gradient is scaled by number of attention mechanisms in use
Initialization
careful weight initialization schemes are used to ensure that the variance does not propagate through the network
embeddings initialized with N(0, 0.1)
layers not fed to GLU directly, initialized with N(0, sqrt(1/n_l) where n_l is the number of input connections to each neuron
inputs to GLU are initialized with N(0, sqrt(4/n_l) because GLU activation has 4 times the variance of the layer input
for dropout, initialization is scaled by p
Results
WMT16 ENRo, WMT14 EnDe, WMT14 EnFr SoTA
Training
WMT14 EnFr
15 enc/dec layers (512 x 5, 768 x 4 1024 x 3, 2048 x 2, 4096 x 1)
effective context size of 25 tokens
8 GPUs, batch_size 32, 37 days of training
Inference Speed
x9 ~ 20 faster than GNMT GPU
Ablation Studies
Positional Embeddings : removing PE reduces BLEU and PPL
Multi-step attention improves performance with very small overhead (3674 words/sec -> 3772 words/sec)
Encoder/Decoder Layer : Deeper encoder is good, Shallow decoder is enough
Abstract
Details
k
input elements embedded ind
dimension and map it to2d
element as outputd
element as outputf
and convolution outputs of size2d
. apply such a transformation when feeding embeddings to encoder network, to the encoder output, to the final layer of decoder just before softmax, to all decoder layers gbefore computing attention scoresmulti-hop attention
)l
, the attention is computed as a dot-product between decoder state summaryd_l
and each outputz
of the last encoder blocke_j
to weighted sum calculation, becausee_j
provides point information about a specific input element that is useful when making a predictionsqrt(0.5)
to output of residual blockc
for attention is scaled bym * sqrt(1/m)
N(0, 0.1)
N(0, sqrt(1/n_l)
wheren_l
is the number of input connections to each neuronN(0, sqrt(4/n_l)
because GLU activation has 4 times the variance of the layer inputp
Training
512 x 5, 768 x 4 1024 x 3, 2048 x 2, 4096 x 1
)Inference Speed
Ablation Studies
3674 words/sec -> 3772 words/sec
)Personal Thoughts
Link : https://arxiv.org/pdf/1705.03122.pdf Authors : Gehring et al. 2017