In this note, I'll continue recording several findings whatever I think it's important or useful. I'll be focusing on the theoretical and heuristic parts in several GANs papers. This thread will be actively updated whenever I read a GANs paper! :blush:
Notations:
p_{data}: Probability density/mass function of real data.
p_{g}/{d}: Probability density/mass function of generator/discriminator.
For G fixed, the optimal D is: D*{G} (x) = p{data}(x) / (p{data}(x) + p{g}(x)).
Global optimality: GANs has a global optimum for p{g} = p{data} (i.e., the generator perfectly replicating the real data distribution).
Essentially, the loss function of GAN quantifies the similarity between the p{g} and p{data} by JS divergence (symmetric) when the discriminator is optimal.
Convergence: If G and D have enough capacity, and at each step of training, the discriminator is allowed to reach its optimum, given G, and p{g} is updated so as to improve the criterion then p{g} converges to p_{data}.
G must not be trained too much without updating D, in order to avoid mode collapse in G.
Note: The discussion is under the scope of vanilla GANs.
Training GANs requires finding the Nash equilibrium of a game, which is a more difficult problem than optimizing an objective function.
Simply flipping the sign on the discriminator's objective function for the generator (i.e., maximizing the cross-entropy loss of the discriminator) could make the generator's gradient be vanished when the discriminator successfully rejects generator samples with high confidence.
MLE (maximum likelihood estimation) is equivalent to minimizing KL divergence KL(p{data} || p{g}).
VAE (variational autoencoder) v.s. GAN: VAE maximizes MLE but GANs aims to generate realistic samples instead of maximizing MLE.
GANs minimizes JS divergence which is similar to minimizing reverse KL divergence (i.e. KL(p{g} || p{data}). (KL divergence is not symmetric).
GANs do not use MLE, but it can be do so by modifying the generator's objective function, under the assumption that the discriminator is optimal. GANs still generate realistic samples even using MLE. (See the paper "On Distinguishability Criteria for Estimating Generative Models" by Goodfellow. ICLR 2015. Also see the video at 55:00). Thus, the choice of the divergence (KL v.s. reverse KL) cannot explain why GANs can generate realistic samples.
Maybe it is the approximation strategy of using supervised learning to estimate the density ratio that leads to the generated samples very realistic. (See the video at 59:15)
GANs often choose to generate from very few modes; fewer than the limitation imposed by the model capacity. The reverse KL prefers to generate from as many modes of the data distribution as the model is able to; it does not prefer fewer modes in general. This suggests that the mode collapse is driven by a factor other than the choice of divergence.
Comparison to MLE and NCE: See #25.
Training tricks:
Virtual batch norm > batch norm (avoid to generate highly correlated samples within a batch)
Mode collapse is believed not be caused by minimizing the reverse KL, since minimizing the forward KL still happens mode collapse. The deficiency design of minimax game could be a reason causing mode collapse. See the paper "Unrolled Generative Adversarial Networks" that successfully generate different modes of data.
Model architectures that cannot capture global structure will cause generated images with wrong global structure.
Preface
In this note, I'll continue recording several findings whatever I think it's important or useful. I'll be focusing on the theoretical and heuristic parts in several GANs papers. This thread will be actively updated whenever I read a GANs paper! :blush:
Notations:
Generative Adversarial Nets (NIPS 2014)
NIPS 2016 Tutorial: Generative Adversarial Networks (Video version)
Generative Adversarial Networks (GANs): What it can generate and What it cannot? (Arxiv 2018)
This paper summarizes many GANs papers for addressing different challenges. Nice summary!