I'm confused that logits_per_text is just transpose of logits_per_text.

Thanks for your contribution!

When I'm reviewing the source code of CLIP, I found out that the output of "model(img,text)" is a list of two same-value tensors. Back to /clip/model.py, it shows model.eval() prints logits_per_image, and logits_per_text, and the latter one is just a transpose of the former one. Besides, the logits_per_image is already the multiplication of image features and text features.

I'd appreciate it if someone could answer my confusion.

openai / CLIP

I'm confused that logits_per_text is just transpose of logits_per_text. #387