I'm experimenting with Fashion CLIP and noticed my zero-shot classification scores were lower when using the in-built zero_shot_classification(images, text_labels) method compared to the scores I got by first calculating the embeddings, then similarities and finally predictions step by step.
What I've found is that in the _cosine_similarity(key_vectors, space_vectors, normalize) method, only the key_vectors (corresponding to image embeddings) are being normalized, so it's not really calculating the cosine similarity (as both vectors need to be normalized) and it's degrading performance.
I'm experimenting with Fashion CLIP and noticed my zero-shot classification scores were lower when using the in-built
zero_shot_classification(images, text_labels)
method compared to the scores I got by first calculating the embeddings, then similarities and finally predictions step by step.What I've found is that in the
_cosine_similarity(key_vectors, space_vectors, normalize)
method, only thekey_vectors
(corresponding to image embeddings) are being normalized, so it's not really calculating the cosine similarity (as both vectors need to be normalized) and it's degrading performance.