openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.55k stars 3.2k forks source link

If the text embedding can be recovered to text? #428

Open Zhangwenyao1 opened 6 months ago

Zhangwenyao1 commented 6 months ago

Thanks for your excellent work! I want to know if the text embedding can be recovered to text.

hamza13-12 commented 5 months ago

I believe you'd have to train your own decoder to make something like that work

zer0int commented 3 months ago

If you mean "getting a CLIP opinion about an image", yes, you can do that using gradient ascent. You can optimize for the text that is "most alike" the image features, and get a "stochastic CLIP textual descriptions" of the image.

If you feed an image of a cat, CLIP will certainly conclude "cat", amongst many other things that may be puzzling to humans. For example, it may conclude "map" about your tabby cat's fur pattern. Or construct conjoined long words like "hallucinkaleidodimensional" about something colorful.

I made an intuitive and interactive GUI with attention visualization (you can see "where CLIP was looking" for a given word): https://github.com/zer0int/CLIP-XAI-GUI

Or, for batch processing via the command-line: https://github.com/zer0int/CLIP-text-image-interpretability