Analyzing Uncertainty in Neural Machine Translation

Abstract

Technical report of how FAIR team analyzed the classic problems in NMT
- performance degredation with large beams
- under-estimation of rare words
- lack of diversity in final translation
relates these issues to uncertainty in NMT
- inherent uncertainty due to the existence of multiple valid translations for a single source
- extrinsic uncertainty due to noisy training data

Datasets
- WMT14 EnDe : 4.5M sentences, 40k bpe
- WMT17 EnDe : 5.9M sentences
- WMT14 EnFr : 35.5M sentences, 40k bpe
Performance : we can say EnFr model is well-trained by human evaluation

Left in Figure 1 shows that even after drawing 10k samples, we cover only 24.9% of sequence-level probability mass
Center in Figure 1 shows that BLEU and model probability is imperfectly correlated because the max bleu sampling candidates plateau in average token probability after 100 hypotheses
Right in Figure 1 shows that BLEU and model probability is again imperfectly correlated
Figure 2 shows that beam search has very confident token probability, than reference of sampling in choosing the token. model has its bias

copied noise, exact copy of source in target corpus, is an example of data noise where even small amount can degrade beam_search with high beam_size
although copy sentence has low token probability in initial beam search stage, once it picks up the tokens, the confidence is too high (because there is no other option to choose from but continue copying) and cumulative logprob increases at the end of the beam search
when copy noise is filtered, beam search performs comparatively well on large beam size

beam search and sampling (model) actually under-represents rare words and over-represents frequent words
word w is replaced with w1 or w2 with probability p(w1 | w) and trained the model on modified data. Trained model is analyzed whether it can estimate the replacement rate that determines the frequency of w1 and w2.
- sampling does estimate well, so the frequency statistics is inherently compressed into the model
- beam search over-estimate frequent one and under-estimate the rare one. beam prefers common alternatives

screen shot 2018-03-22 at 5 07 15 pm

Setting
- 10 distinct human translators made 10 distinct reference translations for 500 test sets
- oracle reference denotes best BLEU between top hypothesis and 10 human translations
- average oracle denotes average BLEU between k hypotheses and 10 human translations
Interpretation
- k=5 beam has high oracle reference and high average oracle - meaning most hypotheses are accurate and near human translations, but number of refs covered is low - meaning diversity is low
- k=200 beam shows similar tendency with reduced BLEU, and little bit of rise in diversity
- sampling k=200 shows the opposite : average oracle is low - meaning several hypotheses poorly match 10 references, but number of refs covered is high - meaning the coverage and diversity is high
hence, model distribution is too concentrated in hypothesis space and does not cover diverse translations