mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.15k stars 908 forks source link

How to extract 1024 width patch embeddings and CLS embedding #844

Open alvaro-stylesage opened 3 months ago

alvaro-stylesage commented 3 months ago

Hello, I have seen that any of encode_image, _encode_image or forward methods return img_latents and img_embeds in 768 dimension; this means after the last projection layer. However, in the /open_clip/model_configs/coca_ViT-L-14.json file you specify that the width of the vision encoder is 1024. I have 2 concerns:

  1. Why is the img_embeds size (1, 255, 768) for one image if there should be 256 patches?
  2. How can I get the raw embeddings after the vision encoder of size 1024?

Thanks!

rwightman commented 2 months ago

@alvaro-stylesage the coca embeds are a bit wrong... https://github.com/mlfoundations/open_clip/issues/458#issuecomment-1457281651

it 'works' but it's not 100% correct