Technical report of how FAIR team analyzed the classic problems in NMT
performance degredation with large beams
under-estimation of rare words
lack of diversity in final translation
relates these issues to uncertainty in NMT
inherent uncertainty due to the existence of multiple valid translations for a single source
extrinsic uncertainty due to noisy training data
Details
Datasets
WMT14 EnDe : 4.5M sentences, 40k bpe
WMT17 EnDe : 5.9M sentences
WMT14 EnFr : 35.5M sentences, 40k bpe
Performance : we can say EnFr model is well-trained by human evaluation
Model Output Distribution
Left in Figure 1 shows that even after drawing 10k samples, we cover only 24.9% of sequence-level probability mass
Center in Figure 1 shows that BLEU and model probability is imperfectly correlated because the max bleu sampling candidates plateau in average token probability after 100 hypotheses
Right in Figure 1 shows that BLEU and model probability is again imperfectly correlated
Figure 2 shows that beam search has very confident token probability, than reference of sampling in choosing the token. model has its bias
Performance Degradation with Large Beams
copied noise, exact copy of source in target corpus, is an example of data noise where even small amount can degrade beam_search with high beam_size
although copy sentence has low token probability in initial beam search stage, once it picks up the tokens, the confidence is too high (because there is no other option to choose from but continue copying) and cumulative logprob increases at the end of the beam search
when copy noise is filtered, beam search performs comparatively well on large beam size
Under-Estimation of Rare Words
beam search and sampling (model) actually under-represents rare words and over-represents frequent words
word w is replaced with w1 or w2 with probability p(w1 | w) and trained the model on modified data. Trained model is analyzed whether it can estimate the replacement rate that determines the frequency of w1 and w2.
sampling does estimate well, so the frequency statistics is inherently compressed into the model
beam search over-estimate frequent one and under-estimate the rare one. beam prefers common alternatives
Lack of Diversity in Final Translation
Setting
10 distinct human translators made 10 distinct reference translations for 500 test sets
oracle reference denotes best BLEU between top hypothesis and 10 human translations
average oracle denotes average BLEU between k hypotheses and 10 human translations
Interpretation
k=5 beam has high oracle reference and high average oracle - meaning most hypotheses are accurate and near human translations, but number of refs covered is low - meaning diversity is low
k=200 beam shows similar tendency with reduced BLEU, and little bit of rise in diversity
sampling k=200 shows the opposite : average oracle is low - meaning several hypotheses poorly match 10 references, but number of refs covered is high - meaning the coverage and diversity is high
hence, model distribution is too concentrated in hypothesis space and does not cover diverse translations
Personal Thoughts
excellent experimental setups for each of the questions they are pursuing after
good interpretation of results and even proposes a solution for copy noise
wish I can learn to make good experiments and interpretations!
Abstract
Details
Model Output Distribution
Performance Degradation with Large Beams
Under-Estimation of Rare Words
w
is replaced withw1
orw2
with probabilityp(w1 | w)
and trained the model on modified data. Trained model is analyzed whether it can estimate the replacement rate that determines the frequency ofw1
andw2
.Lack of Diversity in Final Translation
k
hypotheses and 10 human translationsk=5
beam has high oracle reference and high average oracle - meaning most hypotheses are accurate and near human translations, but number of refs covered is low - meaning diversity is lowk=200
beam shows similar tendency with reduced BLEU, and little bit of rise in diversitysampling k=200
shows the opposite : average oracle is low - meaning several hypotheses poorly match 10 references, but number of refs covered is high - meaning the coverage and diversity is highPersonal Thoughts
Link : https://arxiv.org/pdf/1803.00047.pdf Authors : Ott et al. 2018