Empirical results on training NMT in large scale E-commerce setting by Booking.com
Covers optimization, training and evaluation
Details
Model Architecture
4-layer LTSM written in Lua
Use global attention
Use "case" embedding feature
0.3 residual
no batch size indicated
Handles named entity by pre-processing the input, detecting NE-tag in both sentences and replacing it with placeholder and simply copying it via attention map
Optimizer
1M En-De dataset
SGD vs Adam vs Adagrad vs Adadelta (1.0, 0.0002, 0.1, 1.0)
SGD performs best
Multi-GPU
Async vs Sync Multi-GPU
single GPU performs best ~ opposite of our in-house result
Corpus Size
1M, 2.5M, 5M, 7.5M, 10M corpus ran 90M iterations
10M performs best after-all, with higher human eval which is latent in BLEU score (more data, the better it is)
Evaluation
Adequacy + Fluency metric
Personal Thoughts
Solid works and experiments on NMT
In-house data seems to be abundant and strong
good to see that they openly publish their results
Abstract
Details
Model Architecture
Optimizer
Multi-GPU
Corpus Size
Evaluation
Personal Thoughts
Link : https://arxiv.org/pdf/1709.05820.pdf Authors : Levin et al. 2017