Open kjh8331267 opened 7 months ago
Encoder
Decoder
mapping a query and a set of key-value pairs
output( weighted sum of the values)
$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt(dk)})V$
the dot products of the query with all keys
divide each by √dk
apply a softmax function
weights on the values
in parallel
different representation subspaces at different positions
encoder-decoder attention layer
self-attention layers
self-attention layers in the decoder : masking(prevent leftward information flow)
$FFN(x) = max(0, xW_1+b_1)W_2+b_2$
two linear transformations
ReLU activation
the input tokens and output tokens to vectors of dimension $d_{model}$
linear transformation, softmax function
the input embeddings at the bottoms of the encoder and decoder stacks
sine and cosine functions
the total computational complexity per layer
parallelized
long-range dependencies
WMT 2014 English-German dataset
8 NVIDIA P100 GPUs
Adam optimizer
varied the learning rate
Residual Dropout
Label Smoothing
WMT 2014 English-to-German translation task : BLEU score of 28.4
WMT 2014 English-to-French translation task : BLEU score of 41.0
https://arxiv.org/abs/1706.03762
Abstract
1 Introduction
2 Background
3 Model Architecture
3.1 Encoder and Decoder Stacks
Encoder
Decoder
3.2 Attention
mapping a query and a set of key-value pairs
output( weighted sum of the values)
3.2.1 Scaled Dot-Product Attention
$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt(dk)})V$
the dot products of the query with all keys
divide each by √dk
apply a softmax function
weights on the values
3.2.2 Multi-Head Attention
in parallel
different representation subspaces at different positions
3.2.3 Applications of Attention in our Model
encoder-decoder attention layer
self-attention layers
self-attention layers in the decoder : masking(prevent leftward information flow)
3.3 Position-wise Feed-Forward Networks
$FFN(x) = max(0, xW_1+b_1)W_2+b_2$
two linear transformations
ReLU activation
3.4 Embeddings and Softmax
the input tokens and output tokens to vectors of dimension $d_{model}$
linear transformation, softmax function
3.5 Positional Encoding
the input embeddings at the bottoms of the encoder and decoder stacks
sine and cosine functions
4 Why Self-Attention
the total computational complexity per layer
parallelized
long-range dependencies
5 Training
5.1 Training Data and Batching
WMT 2014 English-German dataset
5.2 Hardware and Schedule
8 NVIDIA P100 GPUs
5.3 Optimizer
Adam optimizer
varied the learning rate
5.4 Regularization
Residual Dropout
Label Smoothing
6 Results
6.1 Machine Translation
WMT 2014 English-to-German translation task : BLEU score of 28.4
WMT 2014 English-to-French translation task : BLEU score of 41.0
6.2 Model Variations
6.3 English Constituency Parsing
7 Conclusion
Paper URL
https://arxiv.org/abs/1706.03762