lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
MIT License
5.56k stars 643 forks source link

What is the image input for inference? #435

Open SenHe opened 2 years ago

SenHe commented 2 years ago

Thanks for this great work!

After going through the code, I got some questions.

  1. In the first stage of training discrete VAE, we already trained a code book. Why we don't use it for second stage training but initialize a new code book for images.

  2. During training, we use the original image as input. During inference, how to set the image input? Is it a random noise with size 3x256x256? How do we do the casual attention in transformer for inference?

kingnobro commented 2 years ago

After reading codes, I also don't know why there is another new code book. Do you have any idea now?