Confidence through Attention

Abstract

Use attention distribution to evaluate the translation quality
Use attention-filtered synthetic data added to existing parallel corpus to improve NMT translation quality in BLEU

Attention-based Metrics

Coverage Deviation Penalty
- aims to penalize the sum of attentions per input token for going too far from 1
Absentmindedness Penalty
- dispersion of attention is measured via the entropy of the predicted attention distribution. Again, we want the penalty value to be 1.0 for the lowest entropy and head towards 0.0 for higher entropies
Training NMT with additional data
- back-translation is good
- but, attention-filtered synthetic data is also better
- it helps especially in morphologically rich -> weak language direction

Making data more rich and smoother by back-translation, copied-corpus, seq-level knowledge distillation and attention-filtered synthetic corpus is very strong