Discrete Autoencoders for Sequence Models

Abstract

propose training discrete autoencoder with improved semantic hashing
propose a quantitative efficiency measure to evaluate autoencoder for sequence models
analyze latent codes produced by language model using discrete autoencoder
present an application of autoencoder-augmented decoder in neural translation

History of AutoEncoder
- continuous autoencoder with bottleneck by Hinton & Salakhutdinov 2006
- denoising autoencoder by Vincent et al 2010
- variational autoencoder by Kingma & Welling et al 2013
- all of the above is continuous
Discrete latent representation is a potentially a better fit
- language is inherently discrete
- reasoning, planning, and reinforcement learning requires discrete representation
Learning discrete latent representation in deep learning is a challenge!
- gradient signals may vanish over discrete variables during back propagation, so use semantic hashing technique
Evaluating discrete latent representation is difficult : we compare perplexity of language model with and without latent component
Application of discrete latent representation lies in better interpretability, control over latent space and possibly faster and diverse decoding

k convolutional blocks reduce original sequence length by 2^k
first 10k training is done with ground-truth as continuous representation for pre-training reason

Language Model
- compare validity of autoencoder by comparing LM with and without latent representation appended in front
- Evaluation is done via discrete sequence autoencoding efficiency
- result shows that latent representation via semantic hashing has decent DSAE compared to Gumbel-Softmax

Noise level in saturating sigmoid
- noise std does not seem to have huge impact.. why?
Deciphering Latent Code
- l1, l2, l3, l4 :
- l1, l2, l2, l4 :
- l1 stands for All and repetition of l2 generates a repetition of EUR 50.00
- l5, l2, l2, l4 :
- latent symbols seem to depend on other symbols before them

beam search has low diversity in its beams. It does not provide semantically equivalent but n-gram wise different candidates as much (same for sampling)
propose mix decoding where latent representation is first predicted and target sentence is decoded from latent representation - the diversity can be obtained

Wow, very interesting to see discrete latent representation containing some meanings of language
would like to learn the active research momentum Google Brain team has..
- define the problem, be aware of old~new techniques, engineer them to make it work, analyze in-depth to understand what is working and why
how is the language inherently discrete? Don't we (human brain) think in a continuous way and store the latent variables within us as a continuous form? The output may be discrete, but do we really need to make the latent space discrete? are we making it discrete to resolve the posterior collapse in continuous model?