NMT is expensive to train, hence hyper-parameter search is difficult
Report empirical results for several hundred experiments, corresponding to 250,000 GPU hours with WMT En-De translation task
Details
Training
NVIDIA K40m, K80
Distributed over 8 workers and 6 param severs per experiment in parallel manner
batch_size 128, beam 10 during decoding with beam search
2.M steps run, replicated 4 times with different initialization
save model checkpoint every 30 min
Baseline Model
2 Layer BiDi with 512-unit GRU, Multiplicative attention
Dropout 0.2 at input of each cell
Train using Adam w 0.0001 w/o decay
Embedding Dimensionality
model improves with bigger dimension, but increment is marginal and 128 performs well
RNN Cell Variant
LSTM > GRU
in complicated task, LSTM > GRU, but it's seen that GRU > LSTM in param-limited task due to GRU's efficient compression
Encoder - Decoder Depth
If too deep, training is difficult
Depth 4 is adequate, and residual/dense residual allows faster training, but no big performance bump
UniDirectional vs BiDirectional Encoder
BiDirectional performs better, but speed is slower due to difficulty in parallelization
Attention Mechanism
Additive attention outperforms all, perhaps getting the advantage of residualness?
Attention allows much faster training than w/o attention ~ suggests that the attention acts more like a 'weighted skip connection' that optimizes gradient flow than like a 'memory' that allows the encoder to access source states, as commonly stated in the literature
Beam Search Streategy
Large beams with length penalty is best
Best optimized model
combination of best among experiments and its result
Abstract
Details
Training
Baseline Model
Embedding Dimensionality
RNN Cell Variant
Encoder - Decoder Depth
UniDirectional vs BiDirectional Encoder
Attention Mechanism
Beam Search Streategy
Best optimized model
Open Source
Personal Thoughts
Link : https://arxiv.org/pdf/1703.03906.pdf Authors : Britz et al. 2017