Massive Exploration of Neural Machine Translation Architectures

Abstract

NMT is expensive to train, hence hyper-parameter search is difficult
Report empirical results for several hundred experiments, corresponding to 250,000 GPU hours with WMT En-De translation task

Details

Training
- NVIDIA K40m, K80
- Distributed over 8 workers and 6 param severs per experiment in parallel manner
- batch_size 128, beam 10 during decoding with beam search
- 2.M steps run, replicated 4 times with different initialization
- save model checkpoint every 30 min
Baseline Model
- 2 Layer BiDi with 512-unit GRU, Multiplicative attention
- Dropout 0.2 at input of each cell
- Train using Adam w 0.0001 w/o decay
Embedding Dimensionality
- model improves with bigger dimension, but increment is marginal and 128 performs well
RNN Cell Variant
- LSTM > GRU
- in complicated task, LSTM > GRU, but it's seen that GRU > LSTM in param-limited task due to GRU's efficient compression
Encoder - Decoder Depth
- If too deep, training is difficult
- Depth 4 is adequate, and residual/dense residual allows faster training, but no big performance bump
UniDirectional vs BiDirectional Encoder
- BiDirectional performs better, but speed is slower due to difficulty in parallelization
Attention Mechanism
- Additive attention outperforms all, perhaps getting the advantage of residualness?
- Attention allows much faster training than w/o attention ~ suggests that the attention acts more like a 'weighted skip connection' that optimizes gradient flow than like a 'memory' that allows the encoder to access source states, as commonly stated in the literature
Beam Search Streategy
- Large beams with length penalty is best
Best optimized model
- combination of best among experiments and its result
Open Source
- all the codes, preprocessing available at https://github.com/google/seq2seq/

Personal Thoughts

Good empirical experiment which can be done in Google only
interesting finding on how to perceive Attention
The result pretty much supports our intuition

Link : https://arxiv.org/pdf/1703.03906.pdf Authors : Britz et al. 2017

kweonwooj / papers

Massive Exploration of Neural Machine Translation Architectures #57

Abstract

Details

Personal Thoughts