kjh8331267 / Paper_Review

1 stars 0 forks source link

Attention Is All You Need #2

Open kjh8331267 opened 7 months ago

kjh8331267 commented 7 months ago

Abstract

sequence transduction models
recurrent or convolutional neural networks
an attention mechanism
Transformer
solely on attention mechanisms
two machine translation tasks
- English-to-German translation task
- English-to-French translation task
superior in quality
more parallelizable
less time to train
BLEU score(Bilingual Evaluation Understudy Score)
generalizes well
English constituency parsing
1 Introduction
encoder-decoder architectures
Attention mechanisms
Transformer (relying entirely on an attention mechanism)
2 Background
reducing sequential computation
Multi-Head Attention
- reduced to a constant number of operations
- the cost of reduced effective resolution
Self-attention
- relating different positions of a single sequence
  3 Model Architecture
Transformer
self-attention, fully connected layers, the encoder and decoder

3.1 Encoder and Decoder Stacks

Encoder
- multi-head self-attention mechanism
- positionwise fully connected feed-forward network
- residual connection
- layer normalization
Decoder
- multi-head attention
- masking
- prevent positions from attending to subsequent positions
  3.2 Attention
mapping a query and a set of key-value pairs
output( weighted sum of the values)

3.2.1 Scaled Dot-Product Attention

$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt(dk)})V$
the dot products of the query with all keys
divide each by √dk
apply a softmax function
weights on the values

3.2.2 Multi-Head Attention
in parallel
different representation subspaces at different positions

3.2.3 Applications of Attention in our Model
encoder-decoder attention layer
- queries : the previous decoder layer
- keys and values : the output of the encoder
self-attention layers
- all of the keys, values and queries : the same place
self-attention layers in the decoder : masking(prevent leftward information flow)

3.3 Position-wise Feed-Forward Networks

$FFN(x) = max(0, xW_1+b_1)W_2+b_2$
two linear transformations
ReLU activation

3.4 Embeddings and Softmax
the input tokens and output tokens to vectors of dimension $d_{model}$
linear transformation, softmax function

3.5 Positional Encoding
the input embeddings at the bottoms of the encoder and decoder stacks
sine and cosine functions

4 Why Self-Attention
the total computational complexity per layer
parallelized
long-range dependencies

5 Training

5.1 Training Data and Batching
WMT 2014 English-German dataset

5.2 Hardware and Schedule
8 NVIDIA P100 GPUs

5.3 Optimizer
Adam optimizer
varied the learning rate

5.4 Regularization
Residual Dropout
Label Smoothing

6 Results

6.1 Machine Translation
WMT 2014 English-to-German translation task : BLEU score of 28.4
WMT 2014 English-to-French translation task : BLEU score of 41.0

6.2 Model Variations

6.3 English Constituency Parsing

7 Conclusion

Paper URL

https://arxiv.org/abs/1706.03762