Attention Focusing for Neural Machine Translation by Bridging Source and Target Embeddings

Abstract

Propose architectural design that shortens the distance between source and target words in NMT task
Analyze how alignment, over-/under-translation has improved

screen shot 2018-05-30 at 2 30 21 pm

In NMT, source words are encoded first and weighted average (via attention) of encoded source representation is fed into decoder with previously decoded target token in order to generate new target token
Authors point out that bridging the distance between target token and source token leads to better model

Source-side bridging
- concatenate source embedding on final source representation
Target-side bridging
- add single source embedding as direct input for decoder
Direct bridging
- add training loss that minimizes the distance between target word and their most relevant source word embedding selected via attention

NIST Zh-En 1.25M dataset
Bi-Directional GRU with Attention
All bridgings improve BLEU score, but SMT is very competitive with RNNSearch in this dataset

Alignment
- percentage of target eos attending to source eos is much higher with direct bridging
- POS-based alignment improves slightly
- Alignment Error Rate (AER) improves
Over-translation
- ROT (rate of over-translation) metric proposed by [Li et al 2017]() is computed as all POSs in source part divided by number of words in word set
- ROT decreased by 15% in direct bridging model
Under-translation
- 1-gram BLEU as proxy for under-translation. 1-gram BLEU improves with direct bridging, but is lower than SMT

Idea of reducing maximum path length between target token and source token seems valid
in-depth analysis of alignment and over-/under-translation was impressive
- it's notoriously hard to tackle over-/under-translation