The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation

Abstract

Apply key modeling and training techniques from Transformer to RNN, yielding a new RNMT+ model which outperforms existing RNMT, ConvS2S and Transformer
Hybrid models obtain further improvements

NMT model has improved from RNMT, CNN and now to Transformer
RNMT
- Google-NMT system is based on RNMT
- strong in its sequential nature with potentially infinite memory
CNN
- Faster training and inference compared to RNMT with stacked CNN enc-dec
- requires meticulous design of gradient scaling for stable training
Transformer
- current SoTA in NMT
- faster in training and inference, but decoder has no memory
- each block follows : normalize > transform > dropout > residual-add sequence

Apply block sequence of Transformer to RNMT
learning rate schedule : warm startup, and converge to constant learning rate, depending on timestep, number of model replicas
adaptive gradient clipping : discard a training step if anomaly is detected in gradient norm
Trained on 32 NVIDIA P100 GPUs

Train model w/o each modules
- Label Smoothing has impact on BLEU
- Multi-head attention has bigger impact on BLEU with Transformer
- Layer Norm has huge impact in stable training
- Sync Training has big impact in stable training for Transformer

Trans Enc + RNMT+ Dec works best
- perhaps because of memory in Decoder?
- Encoder and Decoder definitely has different role
Cascaded Encoder model and Multi-Column Encoder model
Both outperforms naive RNMT+ and Transformer

Is the BLEU score +1.0 significant? What translations are RNMT+ or hybrid models doing better than existing ones?
Transformer had main contributions in its block design and multi-head rather than model being completely attention based.
Well-written, well-experimented paper