tobran / GALIP

[CVPR2023] A faster, smaller, and better text-to-image model for large-scale training
MIT License
225 stars 25 forks source link

Some questions... #11

Open CuddleSabe opened 11 months ago

CuddleSabe commented 11 months ago

Hi, I want to do the Super Resolution task by replace the clip text feature with the clip image feature. I think the image feature space and the text feature space must be the one, so I think it can work. but when I just do it, the model just output some white images, what's the wrong?

tobran commented 11 months ago

Hi, I want to do the Super Resolution task by replace the clip text feature with the clip image feature. I think the image feature space and the text feature space must be the one, so I think it can work. but when I just do it, the model just output some white images, what's the wrong?

Hello, I think the image feature space and the text feature space are not the same. Although CLIP has brought the two spaces as close as possible, there is still some gap between them. Moreover, GALIP uses text features before normalization for training, and it is not appropriate to directly replace text features with image features. You can change GALIP's code and retrain a version based on image features.