Open evergreenllc2020 opened 1 year ago
No need to do the softmax unless you want to do classification. You can do comparisons by computing the cosine similarity between image_features
and text_features
directly.
You don't need to do
image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)
if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy
They only do x /= x.norm() when they use
similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy()
since this isnt actualy cosine similarity, its x @ y.T, thus x and y need to be divided by norm first. But if you use
similarity = (torch.cosine_similarity(image_features, text_features).view(-1, image_features.shape[0]).T.mean(1)).mean(0, True)
im pretty sure youd get the similarity for 1 and 1 and you can even use that for loss in a backward pass if you first multiple that similarity by -1
So, if we calculate the similarity using ,
similarity = F.cosine_similarity(x, y)
without normalizing the image and text features, the computed cosine similarity will have the same output as
x = x / x.norm(dim=-1, keepdim=True)
y = y / y.norm(dim=-1, keepdim=True)
similarity = x @y.T
Is that correct?
You don't need to do
image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)
if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy
If I just want to use the visual encoder to get the output visual featuere for downstream tasks,is it necessay to add the 'image_features /= image_features.norm(dim=-1, keepdim=True)'?
I would like to use CLIP embeddings for text and images in elastic search. It turns out that CLIP embeddings always need at least 2 text inputs for every image, and it does softmax on that. Is there a way to regenerate embeddings so that I can directly use simple cosine similarity between one text input and one image input? Doing softmax in elastic search for 2 text inputs and one image embedding at run time is complicated and expensive. For other embeddings like BERT , we can directly use cosine similarity
Here is sample code. How can I avoid softmax at runtime and just use one text input per image?
with torch.no_grad(): image_features = model.encode_image(image_input) text_features = model.encode_text(text_inputs) image_features /= image_features.norm(dim=-1, keepdim=True) print(image_features.shape) print(text_features.shape) text_features /= text_features.norm(dim=-1, keepdim=True) similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy() print(similarity) similarity = np.exp(similarity) / np.sum(np.exp(similarity), axis=0)