openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.55k stars 3.2k forks source link

How to get all the features of the image encoder and text encoder #431

Open wuyukun-tong opened 5 months ago

wuyukun-tong commented 5 months ago

I want to extract more features than just the 512 dimensional cls token from the CLIP pre-trained model, how can I modify that?