Hello, I have seen that any of encode_image, _encode_image or forward methods return img_latents and img_embeds in 768 dimension; this means after the last projection layer. However, in the /open_clip/model_configs/coca_ViT-L-14.json file you specify that the width of the vision encoder is 1024. I have 2 concerns:
Why is the img_embeds size (1, 255, 768) for one image if there should be 256 patches?
How can I get the raw embeddings after the vision encoder of size 1024?
Hello, I have seen that any of
encode_image
,_encode_image
orforward
methods returnimg_latents
andimg_embeds
in 768 dimension; this means after the last projection layer. However, in the/open_clip/model_configs/coca_ViT-L-14.json
file you specify that the width of the vision encoder is 1024. I have 2 concerns:img_embeds
size (1, 255, 768) for one image if there should be 256 patches?Thanks!