Semi-Autoregressive Neural Machine Translation

Abstract

propose a novel model for fast sequence generation - the Semi-Autoregressive Transformer (SAT)
produce multiple successive words in parallel at each time step (K=2,4,6 etc)
achieves good balance between translation quality and decoding speed on WMT14 EnDe, EnZh
- 5.58x speed up with 88% translation quality in EnDe (max speed up)
- when K=2, SAT is almost lossless (only 1% loss in BLEU)

Sequence Generation tasks suffer from autoregressive nature (need to decode output one by one in sequence)
- although CNN, self-attention modules allowed parallel processing in source/encoder side, target/decoder side is still autoregressive
Recent Works
- Gu et al. 2017 proposed fully non-autoregressive NMT model with fertility function to predict the target length. Significant gain in speed, but degrade translation quality too much.
- Lee et al. 2018 proposed non-autoregressive sequence model with iterative refinement, but quality still suffers
- Kaiser et al 2018 proposed semi-autoregressive model where Transformer model first auto-encodes the sentence into shorter sequence of discrete latent variable in sequence, from which the target sentence is generated in parallel.

screen shot 2018-11-01 at 11 19 19 am

Group-level Chain-Rule
- chain rule is applied to group of tokens with size K
Long-Distance Prediction
- model is prediction K steps ahead
Relaxed Causal Mask
- masking strategy is different in train time
Complexity and Acceleration (a = time on decoder network, b = time on beam search)

train with knowledge distillation (teacher-student model) for better performance

WMT14 EnDe
- with K=2, BLEU is 26.90 (compared to SoTa 27.11), 1.51x speed up
- good balance of speed and quality compared to other non-autoregressive methods
NIST02 EnZh
- with K=2, BLEU is 39.57 (compared to SoTa 40.59), 1.69x speed up

Position-wise Cross-Entropy is high on latter position, indicating that the long-distance prediction is always more difficult
observe frequent repetition issue

Nice implementation. Idea itself is not super-creative because NAT has been out, and it is natural to think of semi-autoregressive model
surprised to see that KD helps a lot in training
very practical paper