propose novel block-wise parallel decoding scheme to make predictions for multiple time steps in parallel and then back off to the longest prefix validated by a scoring model
apply to exsting SoTA in machine translation and image translation
achieves 2x speed up w/o loss in quality (MT)
achieves 3.3x speed up with slight loss in quality (MT)
Details
Introduction
Non-Autoregressive Decoding
Problem : although encoding source sentence can be parallelized via self-attention, decoding target sentence is still autoregressive and hence slow and inefficient
Fully Non-Autoregressive models (by Gu et al 2017) is difficult to train, and leads to a large loss of quality
Discrete latent variable models (by Kaiser et al 2018) does not show SoTA quality
Iterative refinements (by Lee et al 2018) shows impressive results, but speed up is not significant
Blockwise Parallel Decoding
restricted to Greedy Decoding
Algorithm
Predict : predict k block tokens
Verify : find largest k~ that is quality-equivalent to greedy decoding
predict k blocks in parallel with each predicted token as oracle (this step can be re-used as next predict step)
verify the validity of k block tokens, and accept the best k~
Accept : extend result upto k~
Approximate Inference
Top-k selection : relax accept condition by allowing exact match upto top k items
Distance-based selection : in case of image, one can use distance metric d as a criteria
Minimum Block Size : to ensure minimum speedup, we can constrain at least l words to be accepted. Ablation study says that this leads to drop in performance. (min_block_size=1 is best)
Training
pre-train Transformer base model with WMT14 EnDe data for 100k steps
modify decoder part, by extending to k output layers and fine-tune for 100k steps
due to memory constraint, unable to use mean of k cross-entropy loss, so select one of k sub-losses uniformly as a unbiased estimate of the full loss
Knowledge distillation for smoother training
Machine Translation (Experiments)
Methods
Regular : fix the pre-trained model and train modified k output layers only
Distillation : fix the pre-trained model and train modified k outputs with distillation
Fine-Tuning : fine-tune the pre-trained model with modified k outputs
Both : fine-tune the pre-trained model with modified k outputs with distillation
Result
Combining Distillation and Fine-Tuning leads to significant improvement in speed while maintaining quality
Wall-Clock SpeedUp
mean accepted block size is a proxy for speed up, actual wall-clock speed up of 3x is obtained with MT
Example
Generation Process : see step 1 outputs 10 tokens simultaneously
Overall Performance
3x speed up with only 1 BLEU point loss (k=4,6 seems practical)
Personal Thoughts
wow, great paper with simple yet an effective idea!
Abstract
Details
Introduction
Blockwise Parallel Decoding
k
block tokensk~
that is quality-equivalent to greedy decodingk
blocks in parallel with each predicted token as oracle (this step can be re-used as nextpredict
step)k
block tokens, and accept the bestk~
k~
Approximate Inference
k
itemsd
as a criterial
words to be accepted. Ablation study says that this leads to drop in performance. (min_block_size=1 is best)Training
Transformer base
model with WMT14 EnDe data for 100k stepsk
output layers and fine-tune for 100k stepsk
cross-entropy loss, so select one ofk
sub-losses uniformly as a unbiased estimate of the full lossMachine Translation (Experiments)
k
output layers onlyk
outputs with distillationk
outputsk
outputs with distillationWall-Clock SpeedUp
Example
Overall Performance
k=4,6
seems practical)Personal Thoughts
Link : https://arxiv.org/pdf/1811.03115.pdf Authors : Stern et al 2018