purpose of machine learning (artificial intelligence) : what if we apply our intelligence to acquire knowledge about intelligence, and create tools that allow us to overcome our own cognitive limitations?
Probabilistic Models
the process of learning is to approximate the true distribution of data (p*(x)), using observed data x, usually, we are more interested in conditional models (p*(y|x))
we choose prior distribution over the unknown parameters or latent variables, which we update to posterior distribution after seeing the data. One method for computation of such a posterior is variational inference
Deep Learning or Neural Network comes in handy in parameterizing the conditional distribution because neural network has high model capacity and robust learning mechanism (stochastic gradient descent)
Directed (probabilistic) Graphical Models
type of probabilistic models where all the variables are topologically organized into a directed acyclic graph, each "layer" of neural network is dependent on the values of its "parent layer"
Maximum Likelihood
most common criterion for probabilistic model is maximum log-likelihood (ML) which is equivalent to minimum Kullback-Leibler divergence between the data and model distribution
since computation of log probabilities over all data points are expensive, we use stochastic gradient descent which use randomly drawn minibatch of data, which can form an unbiased estimator of ML criterion
Latent Variable
Latent Variables are variables that are part of the model, but which we don't observe, and are therefore not part of the dataset.
in case of unconditional modeling of observed variable x, the directed graphical model would then represent a joint distribution p(x, z) over both the observed variable x and the latent variables z
Deep Latent Variable Models (DLVM) denote a latent variable model p(x, z) whose distributions are parameterized by neural networks. Such a model can be conditioned on some context like p(x,z | y)
advantage of DLVM : even when each factor (prior or conditional distribution) in the directed model is relatively simple (eg, conditional Gaussian), the marginal distribution p(x) can be very complex ~ high expressivity
main difficulty in DLVM is the intractability of p(x) and p(z|x) (p(x, z) is tractable to compute)
hence approximate inference technique on posterior p(z|x) and marginal likelihood p(x) is an important research topic
Research Questions
How can we perform efficient approximate posterior inference and efficient approximate maximum likelihood estimation in deep latent variable models, in the presence of large datasets? : Auto-Encoding Variational Bayes
Can we improve upon the reparameterization-based gradient estimator by constructing a gradient estimator whose variance grows inversely proportional to the minibatch size, without sacrificing parallelizability? : Variational Dropout and the Local Reparameterization Trick
Motivation
Chapter 01 : Introduction and Background
Artificial Intelligence
what if we apply our intelligence to acquire knowledge about intelligence, and create tools that allow us to overcome our own cognitive limitations?
Probabilistic Models
learning
is to approximate the true distribution of data (p*(x)
), using observed datax
, usually, we are more interested in conditional models (p*(y|x)
)prior distribution
over the unknown parameters or latent variables, which we update toposterior distribution
after seeing the data. One method for computation of such a posterior isvariational inference
Directed (probabilistic) Graphical Models
Maximum Likelihood
maximum log-likelihood
(ML) which is equivalent to minimum Kullback-Leibler divergence between the data and model distributionstochastic gradient descent
which use randomly drawn minibatch of data, which can form an unbiased estimator of ML criterionLatent Variable
x
, the directed graphical model would then represent a joint distributionp(x, z)
over both the observed variablex
and the latent variablesz
p(x, z)
whose distributions are parameterized by neural networks. Such a model can be conditioned on some context likep(x,z | y)
p(x)
can be very complex ~ high expressivityp(x)
andp(z|x)
(p(x, z)
is tractable to compute)p(z|x)
and marginal likelihoodp(x)
is an important research topicResearch Questions
How can we perform efficient approximate posterior inference and efficient approximate maximum likelihood estimation in deep latent variable models, in the presence of large datasets?
: Auto-Encoding Variational BayesCan we use the proposed VAE framework to improve upon state-of-the-art semi-supervised classification results?
: Semi-supervised Learning with Deep Generative ModelsDoes there exist a practical normalizing flow that scales well to high-dimensional latent space?
: Improving Variational Inference with Inverse Autoregressive FlowCan we improve upon the reparameterization-based gradient estimator by constructing a gradient estimator whose variance grows inversely proportional to the minibatch size, without sacrificing parallelizability?
: Variational Dropout and the Local Reparameterization TrickCan we improve upon existing stochastic gradient-based optimization methods?
: Adam: A Method for Stochastic OptimizationPersonal Thoughts