Fast Decoding in Sequence Models Using Discrete Latent Variables

Abstract

present a method to extend sequence models using discrete latent variables that makes decoding much more parallelizable
- autoencode the target sequence into shorter sequence of discrete latent variables
- learn the autoregressive models (WaveNet for speech, Transformer for NMT etc) from source to discrete latent variables
- at inference time, generate discrete latent variables and reconstruct via autoencoder in parallel
apply above method in Neural Machine Translation
- Latent Transformer which can be trained end-to-end achieves higher BLEU than other NonAutoregressive models
- still lower BLEU than autoregressive counterpart, but decoding speed is order of magnitude faster

State of the Art in Neural Machine Translation
- RNN, LSTM, GRU was successful in sequence-to-sequence modeling
- ConvS2S, ByteNet introduced to improve parallelization using convolutional architecture
- Transformer, an attention-based model, is current SoTA with improved parallelization in encoding
- but all above models still suffer from autoregressive nature
Autoregressive nature confines us to have slower inference time
- solution 1 : use VAE
- introduce sequence of discrete latent variables such that sequence length is much shorter
- autoregressive models trained over discrete latent variable can be faster, can recover the final sequence using decoder of VAE
- discrete VAE, VQ-VAE has been proposed by van der Oord et al. 2017, and we improve upon this to make it work for NMT
- solution 2 : NonAutoregressive Models
- recent paper by Gu et al. 2017 is hand-tuned specifically for machine translation, and requires reinforcement learning
history of VAE
- continuous AutoEncoders first introduced by Hinton & Salakhutdinov et al. 2006
- stacked denoising AutoEncoder by Vincent et al. 2010 require model to remove added noise
- Variational AutoEncoder by Kingma & Welling et al. 2013
- above are all continuous. Discrete models have seen some success such as
- Gumbel-Softmax Jang et al. 2016
- VQ-VAE by van den Oord et al. 2017
- Improved Semantic Hashing by Kaiser & Bengio, 2018

introduce a framework of fast decoding for autoregressive model via discrete latent variables
propose an improved discretization technique : Decomposed Vector Quantization (DVQ)
application in neural machine translation, Latent Transformer, shows good result with faster decoding time

Gubem-Softmax
- computes log-softmax with equation 2, which makes the model differentiable
- one can adjust temperature value to bias the model toward discrete representation
- at inference, low temperature value is used
Improved Semantic Hashing
- use saturated sigmoid function (eq. 3) to get encoder output (eq. 4) together with binary discretization bottleneck (eq. 5) to generate discrete representation
Vector Quantization
- embedding is found by nearest neighbor lookup (eq. 6)
- VAE is learnt via reconstruction (eq. 7) and embedding is learnt via dictionary learning (VQ) (eq. 8)

Motivation
- VQ-VAE suffers from index collapse where only few indexes are preferred by the model throughout the training as seen in Figure 4
Sliced Vector Quantization
- break up the encoder output enc(y) into n_d smaller slices, like multi-head attention in transformer (eq. 9)

Main Steps
- VAE encoder encodes target sentence y into shorter discrete latent variables l (parallel)
- Latent prediction model - Transformer, is trained to predict l from source sentence x (autoregressive)
- VAE decoder decodes predicted l back to sequence y (parallel)
Loss Function
- reconstruction loss L_r from VAE
- latent prediction loss L_lp from Latent Transformer
- in first 10k steps, true targets y is given to transformer-decoder instead of decompressed latents l which ensures self-attention part has reasonable gradients to train the whole architecture
Architectures of VAE
- Encoder
- conv residual blocks + attention + conv to scale down the dimension
- Decoder
- conv residual blocks + attention + up-conv to scale up the dimension
- Transformer Decoder

WMT16 EnDe with 33k bpe
BLEU is higher than NAT baseline, but lower than AT baseline
Latent Transformer with all variants of discrete latent variables
- VQ-VAE fails to learn
- s-DVQ, p-DVQ, SemHash trains well
- Improved decoding speed

index collapse is resolved by DVQ
Percentage of latents used by DVQ with varying n_d
- why is only 74.5% used at max?
- why does percentage decrease with higher n_d?

Great engineering effort : improving VQ-VAE and proving the performance of generic framework
wanted to see actual sentence examples and their relationship to latent space
will latent models, non-autoregressive models ever achieve equal or better performance than autoregressive counterpart?