How to change the features obtained by the clip encoder[1, 512]

openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

MIT License

25.34k stars 3.26k forks source link

Open lwtgithublwt opened 5 months ago

lwtgithublwt commented 5 months ago

What exactly does the [1,512] feature obtained by the clip encoder mean, and how does it become a lattice of channels, length, and width？