Variational Inference & Deep Learning : New Synthesis _ Ch.01

Motivation

human has intelligence, but is bounded
purpose of machine learning (artificial intelligence) : what if we apply our intelligence to acquire knowledge about intelligence, and create tools that allow us to overcome our own cognitive limitations?

the process of learning is to approximate the true distribution of data (p*(x)), using observed data x, usually, we are more interested in conditional models (p*(y|x))
we choose prior distribution over the unknown parameters or latent variables, which we update to posterior distribution after seeing the data. One method for computation of such a posterior is variational inference
Deep Learning or Neural Network comes in handy in parameterizing the conditional distribution because neural network has high model capacity and robust learning mechanism (stochastic gradient descent)

type of probabilistic models where all the variables are topologically organized into a directed acyclic graph, each "layer" of neural network is dependent on the values of its "parent layer"

most common criterion for probabilistic model is maximum log-likelihood (ML) which is equivalent to minimum Kullback-Leibler divergence between the data and model distribution
since computation of log probabilities over all data points are expensive, we use stochastic gradient descent which use randomly drawn minibatch of data, which can form an unbiased estimator of ML criterion

Latent Variables are variables that are part of the model, but which we don't observe, and are therefore not part of the dataset.
in case of unconditional modeling of observed variable x, the directed graphical model would then represent a joint distribution p(x, z) over both the observed variable x and the latent variables z
Deep Latent Variable Models (DLVM) denote a latent variable model p(x, z) whose distributions are parameterized by neural networks. Such a model can be conditioned on some context like p(x,z | y)
- advantage of DLVM : even when each factor (prior or conditional distribution) in the directed model is relatively simple (eg, conditional Gaussian), the marginal distribution p(x) can be very complex ~ high expressivity
- example of DLVM is Variational AutoEncoder by Kingma & Welling et al 2013
- main difficulty in DLVM is the intractability of p(x) and p(z|x) (p(x, z) is tractable to compute)
- hence approximate inference technique on posterior p(z|x) and marginal likelihood p(x) is an important research topic

How can we perform efficient approximate posterior inference and efficient approximate maximum likelihood estimation in deep latent variable models, in the presence of large datasets? : Auto-Encoding Variational Bayes
Can we use the proposed VAE framework to improve upon state-of-the-art semi-supervised classiﬁcation results? : Semi-supervised Learning with Deep Generative Models
Does there exist a practical normalizing ﬂow that scales well to high-dimensional latent space? : Improving Variational Inference with Inverse Autoregressive Flow
Can we improve upon the reparameterization-based gradient estimator by constructing a gradient estimator whose variance grows inversely proportional to the minibatch size, without sacriﬁcing parallelizability? : Variational Dropout and the Local Reparameterization Trick
Can we improve upon existing stochastic gradient-based optimization methods? : Adam: A Method for Stochastic Optimization

just scratching the surface of Kingma's work seems overwhelming
still not persuaded with the concept of latent variables, how can it be a part of the model and not visible as part of the dataset?