nadavbh12 / VQ-VAE

Minimalist implementation of VQ-VAE in Pytorch
BSD 3-Clause "New" or "Revised" License
494 stars 85 forks source link

Improve results on cifar - nearest neighbor should be performed to 10 dictionaries rather than 1 #8

Closed pclucas14 closed 5 years ago

pclucas14 commented 5 years ago

Hi,

I'm trying to improve results on CIFAR. I see you already have some potential improvements in mind. Could you help me understand what you mean by "Improve results on cifar - nearest neighbor should be performed to 10 dictionaries rather than 1" ? How would you combine the 10 dictionaries during training / testing ?

Thanks! Lucas

nadavbh12 commented 5 years ago

Hi Lucas, This note refers to how the VQ-VAE was actually trained in the paper. I didn't get that in the first (few) reading, so I confirmed it with the authors.

For imagenet, the encoder's output is a tensor of size 8x8x64. If you have only one codebook than for each of the 64 (=8x8) latents you perform nearest neighbor with the codebook, build a new 8x8x64 tensor and pass it on to the decoder. For CIFAR10, where you have 10 codebooks, the encoder's output is a tensor of size 10x8x8x64. Running through the first dimension, for each of the 64 (8x8) latents you perform nearest neighbor with its own codebook. This way, every spatial location can pack more information.

pclucas14 commented 5 years ago

I see. Thanks for the explanation!