Neural Discrete Representation Learning

kweonwooj commented 6 years ago

Abstract

propose Vector Quantised Variational AutoEncoder (VQ-VAE)
- generative model that learns discrete representations
- prior is learnt rather than static
- solves the issue of "posterior collapse" where the latents are ignored when paired with a powerful autoregressive decoder, typically observed in VAE framework
Experiments on images, videos and speech shows that VQ-VAE successfully learns high-level features in unsupervised manner

Details

Introduction

Variational AutoEncoder aims to compress important features into smaller latent space via minimizing reconstruction errors
although continuous VAE models have shown some progress, they suffer from posterior collapse
discrete VAE models are more natural fit for many modalities such as language, image, complex reasoning, planning and predictive learning than continuous
using Vector Quantization, we can train VAE which does not suffer from large variance and avoids posterior collapse
when good discrete latent space is discovered by VQ-VAE, we can train other models over this discrete random variables for further applications

Related Work

many approaches to discrete VAEs include :
- NVIL estimator using single-sample object to optimize variational lower bound
- VIMCO optimizes a multi-sample objective
- Concrete or Gumbel-softmax using continuous reparametrization tricks
none of the above methods close the performance gap of continuous VAEs where Gaussian reparametrization trick is used

VQ-VAE

screen shot 2018-03-22 at 3 38 25 pm

Discrete Latent Variable
- define a latent embedding space e ~ R^(K x D) where K is the size of discrete latent space and D is the dimensionality of each latent embedding vector e_i
Encoder network of VAE
- parametrizes a posterior distribution q(z | x) where x is an input data and z is a latent variable
- posterior categorical distribution q(z | x) is defined as one-hot :
- the nearest element of embedding e is chosen via discretization bottleneck
- gradient is not defined for above equation 2, hence gradient is approximated by straight-through estimator and copied from decoder input (red line in Figure 1)
Embedding Space
- since embeddings receive no gradients from reconstruction loss, we use dictionary learning algorithm - Vector Quantization (VQ) to update embedding parameter.
- VQ objective uses the L2 error to move the embedding vector e_i toward the encoder output z_e(x)
Loss Function
- first loss : reconstruction loss which optimizes VAE encoder & decoder
- second loss : learning embedding parameter, by closing the gap between encoder output and embedding
- third loss : commitment loss that assures encoder output does not grow arbitrarily large
- sg stands for stopgradient, where forward pass is identity and backward pass is zero
- beta=0.25, but values between 0.1 ~ 2.0 has no big impact

Experiments

ImageNet
- compress 128 x 128 x 3 image into 32 x 32 x 1 discrete latent space
- reconstruction is impressive
- PixelCNN trained on ImageNet latent variables. Samples show that images preserve context but lossy in detail
DeepMind Lab
- compress 84 x 84 x 3 image -> 21 x 21 x 1 latent space
- reconstruction preserves texture but lossy in details
Audio
- x64 compression compared to original sound wave
- reconstruction is quite impressive. can hear at author's blog
- voice style transfer is possible with different speaker id

Personal Thoughts

impressive that discrete unsupervised reconstruction can compress high level features this accurately
wonder how the actual code is implemented. still not familiar with gradient approximation, vector quantization
hope we can see more application of well-learnt discrete latent embeddings

Link : https://arxiv.org/pdf/1711.00937.pdf Authors : van den Oord et al. 2017

evanthebouncy commented 5 years ago

pretty cool notes.

i think to train it you just have 2 kind of losses:

1) normal auto-encoder loss 2) clustering loss, i.e. embed input z(x), compute the closest code-vector e, and minimize |z(x) - e|

In the paper they used 2 different loss for 2), one moves e toward z(x), and other moved z(x) toward e:

|sg[z(x)] - e|

and

|z(x) - sg[e]|

where you can think of stop gradient "sg" as turning z(x) into a constant.

should be straight forward in pytorch

edoardogiacomello commented 5 years ago

I'm trying to implement this in tensorflow 2.0 but I've still got some doubts about the training phase. How can can the latent space (also called codebook in later papers) can be learnt if it is supposed to be discrete? They also propose to use an exponential moving average, which would generate a latent space which is no more discrete. I would like to see some code examples of this work

mahendrathapa commented 2 years ago

Isn't the loss function is

kweonwooj / papers