openai / glow

Code for reproducing results in "Glow: Generative Flow with Invertible 1x1 Convolutions"
https://arxiv.org/abs/1807.03039
MIT License
3.11k stars 515 forks source link

Error in Likelihood Computations? #81

Open AlexanderMath opened 5 years ago

AlexanderMath commented 5 years ago

It seems there is an error in the NLL computations on MNIST. The article reports NLL in "bits per pixels". The "per pixel" part is computed by dividing by the shape of x https://github.com/openai/glow/blob/eaff2177693a5d84a1cf8ae19e8e0441715b82f8/model.py#L185 But in "data_loaders/get_mnist_cifar.py" the MNIST data is padded with zeros to size 32x32. https://github.com/openai/glow/blob/eaff2177693a5d84a1cf8ae19e8e0441715b82f8/data_loaders/get_mnist_cifar.py#L40 This has two consequences:

  1. The NLL is divided by 32^2 and not 28^2. This improves the loss.
  2. The "scaling penalty" is also multiplied by 32^2 and not 28^2 (see below). This worsens the loss. https://github.com/openai/glow/blob/eaff2177693a5d84a1cf8ae19e8e0441715b82f8/model.py#L172 It is not clear to us why this would yield the correct likelihood computations.

On an intuitive level, it seems that the "per pixel" loss should be computed by dividing with the original data size 28x28 instead of the padded data size 32x32x3. Below we argue that if the computations were correct we could obtain loss arbitrarily close to 0 by just increasing the amount of zero padding.

Suppose we normalize the input to [-127.5, +127.5] and change the Gaussian to be N(0, 127.5/2). The scaling is then 1 so the log "scaling penalty" becomes 0. Since more zero padding increases the shape and decreases the loss, we can add more and more zero padding and make the loss arbitrarily close to 0, which seems to be a problem.

Our sincerest apologies if we understood something wrong. In either case, we'd be happy to see an argument for how one can compute the likelihood of a 28x28 MNIST image under a model that is trained on a 32x32 padded variant of MNIST. The reasons we are interested in this, is that it would allow a data augmentation trick that interpolates images to larger resolution by zero padding in fourier space. For appropriate scaling the fourier transform is unitary and thus has unit determinant.

christabella commented 4 years ago

I guess the zero padding is like the pre-processing in order to obtain input x, and from the paper "M is the dimensionality of x", which would be 32x32?

While this wouldn't impact the training/optimization since they are constants, I agree that intuitively it seems like the absolute likelihood computation is "wrong" as you pointed out since original x is 28x28.