propose training discrete autoencoder with improved semantic hashing
propose a quantitative efficiency measure to evaluate autoencoder for sequence models
analyze latent codes produced by language model using discrete autoencoder
present an application of autoencoder-augmented decoder in neural translation
Details
Introduction
History of AutoEncoder
continuous autoencoder with bottleneck by Hinton & Salakhutdinov 2006
denoising autoencoder by Vincent et al 2010
variational autoencoder by Kingma & Welling et al 2013
all of the above is continuous
Discrete latent representation is a potentially a better fit
language is inherently discrete
reasoning, planning, and reinforcement learning requires discrete representation
Learning discrete latent representation in deep learning is a challenge!
gradient signals may vanish over discrete variables during back propagation, so use semantic hashing technique
Evaluating discrete latent representation is difficult : we compare perplexity of language model with and without latent component
Application of discrete latent representation lies in better interpretability, control over latent space and possibly faster and diverse decoding
Semantic Hashing
Discretization by Improved Semantic Hashing
To discretize b-dimensional vector v, add Gaussian noise n with mean 0 and std 1, v^n = v + n and compute two vectors v1 = sigma'(v^n) and v2 = (v^n < 0) where sigma'() is a saturating sigmoid function as below
v2 is a discretized value of v used for evaluation and inference
during training, 0.5 * v1 + 0.5 * v2 is used for forward pass, and v1 gets the gradient only
To discretize a dense representation, 1-layer FCN is used = discrete(w)
To convert discrete into dense representation, 3-layer FCN is used = bottleneck(w)
AutoEncoder
k convolutional blocks reduce original sequence length by 2^k
first 10k training is done with ground-truth as continuous representation for pre-training reason
Experiments
Language Model
compare validity of autoencoder by comparing LM with and without latent representation appended in front
Evaluation is done via discrete sequence autoencoding efficiency
result shows that latent representation via semantic hashing has decent DSAE compared to Gumbel-Softmax
Ablation Study
Noise level in saturating sigmoid
noise std does not seem to have huge impact.. why?
Deciphering Latent Code
l1, l2, l3, l4 :
l1, l2, l2, l4 :
l1 stands for All and repetition of l2 generates a repetition of EUR 50.00
l5, l2, l2, l4 :
latent symbols seem to depend on other symbols before them
Mixed Decoding
beam search has low diversity in its beams. It does not provide semantically equivalent but n-gram wise different candidates as much (same for sampling)
propose mix decoding where latent representation is first predicted and target sentence is decoded from latent representation - the diversity can be obtained
Personal Thoughts
Wow, very interesting to see discrete latent representation containing some meanings of language
would like to learn the active research momentum Google Brain team has..
define the problem, be aware of old~new techniques, engineer them to make it work, analyze in-depth to understand what is working and why
how is the language inherently discrete? Don't we (human brain) think in a continuous way and store the latent variables within us as a continuous form? The output may be discrete, but do we really need to make the latent space discrete? are we making it discrete to resolve the posterior collapse in continuous model?
Abstract
Details
Introduction
Semantic Hashing
b
-dimensional vectorv
, add Gaussian noisen
with mean 0 and std 1,v^n = v + n
and compute two vectorsv1 = sigma'(v^n)
andv2 = (v^n < 0)
wheresigma'()
is a saturating sigmoid function as belowv2
is a discretized value ofv
used for evaluation and inference0.5 * v1 + 0.5 * v2
is used for forward pass, andv1
gets the gradient only= discrete(w)
= bottleneck(w)
AutoEncoder
k
convolutional blocks reduce original sequence length by2^k
Experiments
Ablation Study
l1, l2, l3, l4
:l1, l2, l2, l4
:l1
stands forAll
and repetition ofl2
generates a repetition ofEUR 50.00
l5, l2, l2, l4
:Mixed Decoding
mix decoding
where latent representation is first predicted and target sentence is decoded from latent representation - the diversity can be obtainedPersonal Thoughts
meanings
of languageLink : https://arxiv.org/pdf/1801.09797.pdf Authors : Kaiser et al. 2018