Convolutional Sequence to Sequence Learning

Abstract

CNN based Sequence Model
SoTA on 2 NMT tasks with faster inference speed
- WMT14 EnDe, WMT14 EnFr

Details

Convolutional Architecture
- Position Embedding
- sinusoidal position embedding is added to word embedding matrix to give the model a sense of which portion of the sequence in the input or output it is currently dealing with
- Convolutional Block Structure
- each convolution kernel takes k input elements embedded in d dimension and map it to 2d element as output
- Gated Linear Unit serves as a gating mechanism with non-linearity and point-wise multiplication which outputs d element as output
- add residual connection for deep architecture
- for encoder, padding is added to ensure the input length matches the output length
- for decoder, masking is done to hide future information
- add linear mapping to project between the embedding size f and convolution outputs of size 2d. apply such a transformation when feeding embeddings to encoder network, to the encoder output, to the final layer of decoder just before softmax, to all decoder layers gbefore computing attention scores
- softmax layer
- Multi-step Attention
- introduce separate attention mechanism for each decoder layer (multi-hop attention)
- for decoder layer l, the attention is computed as a dot-product between decoder state summary d_l and each output z of the last encoder block
- add input element embedding e_j to weighted sum calculation, because e_j provides point information about a specific input element that is useful when making a prediction
- Normalization Strategy
- careful normalization schemes are used to ensure that the variance does not propagate through the network
- multiply sqrt(0.5) to output of residual block
- conditional input c for attention is scaled by m * sqrt(1/m)
- for decoder with multiple attention, gradient is scaled by number of attention mechanisms in use
- Initialization
- careful weight initialization schemes are used to ensure that the variance does not propagate through the network
- embeddings initialized with N(0, 0.1)
- layers not fed to GLU directly, initialized with N(0, sqrt(1/n_l) where n_l is the number of input connections to each neuron
- inputs to GLU are initialized with N(0, sqrt(4/n_l) because GLU activation has 4 times the variance of the layer input
- for dropout, initialization is scaled by p

screen shot 2018-02-06 at 1 46 36 pm

Results
- WMT16 ENRo, WMT14 EnDe, WMT14 EnFr SoTA

screen shot 2018-02-06 at 1 38 29 pm

Training
- WMT14 EnFr
- 15 enc/dec layers (512 x 5, 768 x 4 1024 x 3, 2048 x 2, 4096 x 1)
- effective context size of 25 tokens
- 8 GPUs, batch_size 32, 37 days of training
Inference Speed
- x9 ~ 20 faster than GNMT GPU
Ablation Studies
- Positional Embeddings : removing PE reduces BLEU and PPL
- Multi-step attention improves performance with very small overhead (3674 words/sec -> 3772 words/sec)
- Encoder/Decoder Layer : Deeper encoder is good, Shallow decoder is enough

Personal Thoughts

well-written paper
SoTA on NMT with x10 inference time is surprising
ConvSeq2Seq in Tensorflow, Lua (official)

Link : https://arxiv.org/pdf/1705.03122.pdf Authors : Gehring et al. 2017

kweonwooj / papers

Convolutional Sequence to Sequence Learning #84

Abstract

Details

Personal Thoughts