propose a novel model for fast sequence generation - the Semi-Autoregressive Transformer (SAT)
produce multiple successive words in parallel at each time step (K=2,4,6 etc)
achieves good balance between translation quality and decoding speed on WMT14 EnDe, EnZh
5.58x speed up with 88% translation quality in EnDe (max speed up)
when K=2, SAT is almost lossless (only 1% loss in BLEU)
Details
Introduction
Sequence Generation tasks suffer from autoregressive nature (need to decode output one by one in sequence)
although CNN, self-attention modules allowed parallel processing in source/encoder side, target/decoder side is still autoregressive
Recent Works
Gu et al. 2017 proposed fully non-autoregressive NMT model with fertility function to predict the target length. Significant gain in speed, but degrade translation quality too much.
Lee et al. 2018 proposed non-autoregressive sequence model with iterative refinement, but quality still suffers
Kaiser et al 2018 proposed semi-autoregressive model where Transformer model first auto-encodes the sentence into shorter sequence of discrete latent variable in sequence, from which the target sentence is generated in parallel.
Semi-Autoregressive Transformer
Group-level Chain-Rule
chain rule is applied to group of tokens with size K
Long-Distance Prediction
model is prediction K steps ahead
Relaxed Causal Mask
masking strategy is different in train time
Complexity and Acceleration (a = time on decoder network, b = time on beam search)
Train
train with knowledge distillation (teacher-student model) for better performance
Result
WMT14 EnDe
with K=2, BLEU is 26.90 (compared to SoTa 27.11), 1.51x speed up
good balance of speed and quality compared to other non-autoregressive methods
NIST02 EnZh
with K=2, BLEU is 39.57 (compared to SoTa 40.59), 1.69x speed up
Case Study
Position-wise Cross-Entropy is high on latter position, indicating that the long-distance prediction is always more difficult
observe frequent repetition issue
Future Work
better design loss function or model for long-distance prediction
explore more stable method of training along with KD
adaptively determine size K by network
Personal Thoughts
Nice implementation. Idea itself is not super-creative because NAT has been out, and it is natural to think of semi-autoregressive model
Abstract
Details
Introduction
Semi-Autoregressive Transformer
a
= time on decoder network,b
= time on beam search)Train
Result
Case Study
Future Work
Personal Thoughts
Link : https://arxiv.org/pdf/1808.08583v2.pdf Code : https://github.com/chqiwang/sa-nmt Authors : Wang et al 2018