After going through the code, I got some questions.
In the first stage of training discrete VAE, we already trained a code book. Why we don't use it for second stage training but initialize a new code book for images.
During training, we use the original image as input. During inference, how to set the image input? Is it a random noise with size 3x256x256? How do we do the casual attention in transformer for inference?
Thanks for this great work!
After going through the code, I got some questions.
In the first stage of training discrete VAE, we already trained a code book. Why we don't use it for second stage training but initialize a new code book for images.
During training, we use the original image as input. During inference, how to set the image input? Is it a random noise with size 3x256x256? How do we do the casual attention in transformer for inference?