Understanding disentangling in β-VAE

Metadata

Authors: Christopher P. Burgess, Irina Higgins, +4 authors Alexander Lerchner
Organization: DeepMind
Publish Date: 2018.04
Paper: https://arxiv.org/pdf/1804.03599.pdf
3rd-party code: https://github.com/1Konny/Beta-VAE

Useful Tutorials of VAE and β-VAE

Read From Autoencoder to Beta-VAE or What a Disentangled Net We Weave: Representation Learning in VAEs for understanding their intuition.
Read Variational Coin Toss for understanding the intuition of variational inference (basics of VAE).
Read variational inference notes in Stanford CS228 - Probabilistic Graphical Models, or refer more mathematical details in A Tutorial on Variational Bayesian Inference.
The original VAE paper and the Notes on Variational Autoencoders.
This paper is a follow-up work of the original β-VAE paper.

Background

β-VAE is a state of the art model for unsupervised visual disentangled representation learning.
β-VAE adds an extra hyperparameter β to the VAE objective, which constricts the effective encoding capacity of the latent bottleneck and encourages the latent representation to be more factorized.
The disentangled representations learned by β-VAE have been shown to be important for learning a hierarchy of abstract visual concepts conducive of imagination (SCAN, Higgins et al.) and for improving transfer performance of reinforcement learning policies, including simulation to reality transfer in robotics (DARLA. Higgins et al.)

Motivation

It is currently unknown what causes the factorized representations learnt by β-VAE to be axis aligned with the human intuition of the data generative factors compared to the standard VAE.
Furthermore, β-VAE has other limitations, such as worse reconstruction fidelity compared to the standard VAE. This is caused by a trade-off introduced by the modified training objective that punishes reconstruction quality in order to encourage disentanglement within the latent representations.
This paper attempts to shed light on the question of why β-VAE disentangles, and to use the new insights to suggest practical improvements to the β-VAE framework to overcome the reconstruction-disentanglement trade-off.

Understanding disentangling in β-VAE

From information bottleneck principle (Tishby et al. 1999) perspective, the β-VAE training objective encourages the latent distribution q(z|x) to efficiently transmit information about the data points x by jointly minimizing the β-weighted KL term and maximizing the data log likelihood.
A strong pressure for overlapping posteriors encourages β-VAE to find a representation space preserving as much as possible the locality of points on the data manifold.
Hypothesis: β-VAE finds latent components which make different contributions to the log-likelihood term of the objective function. These latent components tend to correspond to features in the data that are intuitively qualitatively different, and therefore may align with the generative factors in the data.
For example, consider optimizing the β-VAE objective under an almost complete information bottleneck constraint (i.e. β >> 1). The optimal thing to do in this scenario is to only encode information about the data points which can yield the most significant improvement in data log-likelihood (i.e. Eq(z|x)[log p(x|z)]).

Intuition of Improvement (The most important part)

For example, in the dSprites dataset (consisting of white 2D sprites varying in position, rotation, scale and shape rendered onto a black background) the model might only encode the sprite position under such a constraint. Intuitively, when optimizing a pixel-wise decoder log likelihood, information about position will result in the most gains compared to information about any of the other factors of variation in the data, since the likelihood will vanish if reconstructed position is off by just a few pixels.
Continuing this intuitive picture, we can imagine that if the capacity of the information bottleneck were gradually increased, the model would continue to utilize those extra bits for an increasingly precise encoding of position, until some point of diminishing returns is reached for position information, where a larger improvement can be obtained by encoding and reconstructing another factor of variation in the dataset, such as sprite scale.
They further test this intuition by training a model to generate dSprites conditioned on ground truth factors, with a controllable information bottleneck. Each factor is independently scaled by a learnable parameter and are subject to independently scaled additive noise (also learned), similar to the reparameterized latent distribution in β-VAE. Throughout the training, the capacity of information bottleneck increases linearly. The experiment shows that the early capacity is allocated to positional latents only (x and y), followed by a scale latent, then shape and orientation latents.

Reference

SCAN: Learning Hierarchical Compositional Visual Concepts by Irina Higgins et al. ICLR 2018.
DARLA: Improving Zero-Shot Transfer in Reinforcement Learning by Irina Higgins et al. ICML 2017

howardyclo / papernotes

Understanding disentangling in β-VAE #33

Metadata

Useful Tutorials of VAE and β-VAE

Background

Motivation

Understanding disentangling in β-VAE

Intuition of Improvement (The most important part)

Reference

Further Readings

How to Tune Hyperparameters Gamma and C? (Response by Christopher P. Burgess)