Closed xiaotingxuan closed 1 year ago
+1
Thank you for your interest. If we only use CLIP as an image encoder without incorporating VAE, I believe this might compromise the quality of image generation. This perspective is inspired by the observation that the zero-shot text-to-image FID of DALL-E 2 is not as good as Stable Diffusion. Compared to solely using VAE as an image encoder, adding CLIP as an auxiliary can enhance the quality of image captions. Additionally, we found that for images, ViT-B/32 and ViT-L/14 have similar image caption quality, so we opted for ViT-B/32. As for the text encoder, we used ViT-L/14, following UViT. Moreover, Imagen also demonstrated that a larger text encoder results in better image generation quality.
I notice you use two image encoder, and you say $x_0^{AE}$ is sufficient for image reconstruction, $x_0^{CLIP}$ helps understand the semantics of images. I am curious what is the influence if you just use one of them? And you use ViT-B/32 CLIP for the image encoder, but instead of using its text encoder, you use CLIP ViT-L/14 from stable diffusion. Will it become better when we get image encoder and text encoder from the same CLIP?