Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

https://arxiv.org/pdf/1612.00005v2.pdf

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.

Summary:

Dataset: ImageNet
Objective: Find a generative model that avoids usual shortcomings: (i) high-resolution, (ii) variety of images and (iii) matching the dataset diversity.

Inner-workings:

The idea is to find an image that maximizes the probability for a given label by using a variant of a Markov Chain Monte Carlo (MCMC) sampler.

Where the first term ensures that we stay in the image manifold that we're trying to find and don't just produce adversarial examples and the second term makes sure that find an image corresponding to the label we're looking for.

Basically we start with a random image and iteratively find a better image to match the label we're trying to generate.

MALA-approx:

MALA-approx is the MCMC sampler based on the Metropolis-Adjusted Langevin Algorithm that they use in the paper, it is defined iteratively as follow:

where:

epsilon1 makes the image more generic.
epsilon2 increases confidence in the chosen class.
epsilon3 adds noise to encourage diversity.

Image prior:

They try several priors for the images:

PPGN-x: p(x) is modeled with a Denoising Auto-Encoder (DAE).
DGN-AM: use a latent space to model x with h using a GAN.
PPGN-h: incorporates a prior for p(h) using a DAE.
Joint PPGN-h: to increases expressivity of G, model h by first modeling x in the DAE.
Noiseless joint PPGN-h: same as previous but without noise.

Conditioning:

In the paper they mostly use conditioning on label but captions or pretty much anything can also be used.

Architecture:

The final architecture using a pretrained classifier network is below. Note that only G and D are trained.

Results:

Pretty much any base network can be used with minimal training of G and D. It produces very realistic image with a great diversity, see below for examples of 227x227 images with ImageNet.

leo-p / papers