Open Zhangwenyao1 opened 9 months ago
I believe you'd have to train your own decoder to make something like that work
If you mean "getting a CLIP opinion about an image", yes, you can do that using gradient ascent. You can optimize for the text that is "most alike" the image features, and get a "stochastic CLIP textual descriptions" of the image.
If you feed an image of a cat, CLIP will certainly conclude "cat", amongst many other things that may be puzzling to humans. For example, it may conclude "map" about your tabby cat's fur pattern. Or construct conjoined long words like "hallucinkaleidodimensional" about something colorful.
I made an intuitive and interactive GUI with attention visualization (you can see "where CLIP was looking" for a given word): https://github.com/zer0int/CLIP-XAI-GUI
Or, for batch processing via the command-line: https://github.com/zer0int/CLIP-text-image-interpretability
Thanks for your excellent work! I want to know if the text embedding can be recovered to text.