rosinality / vq-vae-2-pytorch

Implementation of Generating Diverse High-Fidelity Images with VQ-VAE-2 in PyTorch
Other
1.65k stars 275 forks source link

Fidelity to the VQ-VAE-2 paper #32

Closed natoucs-datagen closed 4 years ago

natoucs-datagen commented 4 years ago

Hi, I am trying to build a 2 stage VQ-VAE-2 + PixelCNN as shown in the paper: "Generating Diverse High-Fidelity Images with VQ-VAE-2" (https://arxiv.org/pdf/1906.00446.pdf). I have 3 implementation questions:

  1. The paper mentions: "We allow each level in the hierarchy to separately depend on pixels". I understand the second latent space in the VQ-VAE-2 must be conditioned on a concatenation of the 1st latent space and a downsampled version of the image. However here I see the 2nd latent space is only conditioned on the 1st one. Why ? See: https://github.com/rosinality/vq-vae-2-pytorch/blob/master/vqvae.py#L199

  2. There is no conditional implementation for the PixelCNN here. The paper "Conditional Image Generation with PixelCNN Decoders" (https://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf) says: "h is a one-hot encoding that specifies a class this is equivalent to adding a class dependent bias at every layer". As I understand it, the condition is entered as a 1D tensor that is injected into the bias through a convolution. Now for a 2 stage conditional PixelCNN, one needs to condition on the class vector but also on the latent code of the previous stage. A possibility I see is to append them and feed a 3D tensor. How would you insert both of those conditions in the PixelCNN architecture ?

  3. The loss and optimization are unchanged in 2 stages. One simply adds the loss of each stage into a final loss that is optimized. Is that right ?

natoucs-datagen commented 4 years ago

Answered myself on Stack. https://stackoverflow.com/questions/60884274/implementation-of-vq-vae-2-paper/60974545#60974545