openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
26.19k stars 3.35k forks source link

I'm confused that logits_per_text is just transpose of logits_per_text. #387

Open nietzsche9088 opened 1 year ago

nietzsche9088 commented 1 year ago

Thanks for your contribution!

When I'm reviewing the source code of CLIP, I found out that the output of "model(img,text)" is a list of two same-value tensors. Back to /clip/model.py, it shows model.eval() prints logits_per_image, and logits_per_text, and the latter one is just a transpose of the former one. Besides, the logits_per_image is already the multiplication of image features and text features.

I'd appreciate it if someone could answer my confusion.