When I'm reviewing the source code of CLIP, I found out that the output of "model(img,text)" is a list of two same-value tensors. Back to /clip/model.py, it shows model.eval() prints logits_per_image, and logits_per_text, and the latter one is just a transpose of the former one. Besides, the logits_per_image is already the multiplication of image features and text features.
I'd appreciate it if someone could answer my confusion.
Thanks for your contribution!
When I'm reviewing the source code of CLIP, I found out that the output of "model(img,text)" is a list of two same-value tensors. Back to /clip/model.py, it shows model.eval() prints logits_per_image, and logits_per_text, and the latter one is just a transpose of the former one. Besides, the logits_per_image is already the multiplication of image features and text features.
I'd appreciate it if someone could answer my confusion.