introduce a framework of fast decoding for autoregressive model via discrete latent variables
propose an improved discretization technique : Decomposed Vector Quantization (DVQ)
application in neural machine translation, Latent Transformer, shows good result with faster decoding time
Discretization Techniques
Gubem-Softmax
computes log-softmax with equation 2, which makes the model differentiable
one can adjust temperature value to bias the model toward discrete representation
at inference, low temperature value is used
Improved Semantic Hashing
use saturated sigmoid function (eq. 3) to get encoder output (eq. 4) together with binary discretization bottleneck (eq. 5) to generate discrete representation
Vector Quantization
embedding is found by nearest neighbor lookup (eq. 6)
VAE is learnt via reconstruction (eq. 7) and embedding is learnt via dictionary learning (VQ) (eq. 8)
Decomposed Vector Quantization
Motivation
VQ-VAE suffers from index collapse where only few indexes are preferred by the model throughout the training as seen in Figure 4
Sliced Vector Quantization
break up the encoder output enc(y) into n_d smaller slices, like multi-head attention in transformer (eq. 9)
Latent Transformer
Main Steps
VAE encoder encodes target sentence y into shorter discrete latent variables l (parallel)
Latent prediction model - Transformer, is trained to predict l from source sentence x (autoregressive)
VAE decoder decodes predicted l back to sequence y (parallel)
Loss Function
reconstruction loss L_r from VAE
latent prediction loss L_lp from Latent Transformer
in first 10k steps, true targets y is given to transformer-decoder instead of decompressed latents l which ensures self-attention part has reasonable gradients to train the whole architecture
Architectures of VAE
Encoder
conv residual blocks + attention + conv to scale down the dimension
Decoder
conv residual blocks + attention + up-conv to scale up the dimension
Transformer Decoder
Experiments
WMT16 EnDe with 33k bpe
BLEU is higher than NAT baseline, but lower than AT baseline
Latent Transformer with all variants of discrete latent variables
VQ-VAE fails to learn
s-DVQ, p-DVQ, SemHash trains well
Improved decoding speed
Discussions
index collapse is resolved by DVQ
Percentage of latents used by DVQ with varying n_d
why is only 74.5% used at max?
why does percentage decrease with higher n_d?
Personal Thoughts
Great engineering effort : improving VQ-VAE and proving the performance of generic framework
wanted to see actual sentence examples and their relationship to latent space
will latent models, non-autoregressive models ever achieve equal or better performance than autoregressive counterpart?
Abstract
Latent Transformer
which can be trained end-to-end achieves higher BLEU than other NonAutoregressive modelsDetails
Introduction
Contribution
Latent Transformer
, shows good result with faster decoding timeDiscretization Techniques
Decomposed Vector Quantization
index collapse
where only few indexes are preferred by the model throughout the training as seen in Figure 4enc(y)
inton_d
smaller slices, like multi-head attention in transformer (eq. 9)Latent Transformer
y
into shorter discrete latent variablesl
(parallel)l
from source sentencex
(autoregressive)l
back to sequencey
(parallel)L_r
from VAEL_lp
from Latent Transformery
is given to transformer-decoder instead of decompressed latentsl
which ensures self-attention part has reasonable gradients to train the whole architectureExperiments
Discussions
n_d
n_d
?Personal Thoughts
Link : https://arxiv.org/pdf/1803.03382.pdf Authors : Kaiser et al. 2018