openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25.34k stars 3.26k forks source link

How to change the features obtained by the clip encoder[1, 512] #441

Open lwtgithublwt opened 5 months ago

lwtgithublwt commented 5 months ago

What exactly does the [1,512] feature obtained by the clip encoder mean, and how does it become a lattice of channels, length, and width?