openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25.68k stars 3.29k forks source link

Why does CLIP always need softmax and not simple Cosine Similarity #310

Open evergreenllc2020 opened 1 year ago

evergreenllc2020 commented 1 year ago

I would like to use CLIP embeddings for text and images in elastic search. It turns out that CLIP embeddings always need at least 2 text inputs for every image, and it does softmax on that. Is there a way to regenerate embeddings so that I can directly use simple cosine similarity between one text input and one image input? Doing softmax in elastic search for 2 text inputs and one image embedding at run time is complicated and expensive. For other embeddings like BERT , we can directly use cosine similarity

Here is sample code. How can I avoid softmax at runtime and just use one text input per image?

with torch.no_grad(): image_features = model.encode_image(image_input) text_features = model.encode_text(text_inputs) image_features /= image_features.norm(dim=-1, keepdim=True) print(image_features.shape) print(text_features.shape) text_features /= text_features.norm(dim=-1, keepdim=True) similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy() print(similarity) similarity = np.exp(similarity) / np.sum(np.exp(similarity), axis=0)

Rijgersberg commented 1 year ago

No need to do the softmax unless you want to do classification. You can do comparisons by computing the cosine similarity between image_features and text_features directly.

fractaldna22 commented 1 year ago

You don't need to do

image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)

if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy

fractaldna22 commented 1 year ago

They only do x /= x.norm() when they use

similarity = (100.0 * image_features @ text_features.T)[0].cpu().numpy()

since this isnt actualy cosine similarity, its x @ y.T, thus x and y need to be divided by norm first. But if you use

similarity = (torch.cosine_similarity(image_features, text_features).view(-1, image_features.shape[0]).T.mean(1)).mean(0, True)

im pretty sure youd get the similarity for 1 and 1 and you can even use that for loss in a backward pass if you first multiple that similarity by -1

Externalhappy commented 1 year ago

So, if we calculate the similarity using , similarity = F.cosine_similarity(x, y) without normalizing the image and text features, the computed cosine similarity will have the same output as

x = x / x.norm(dim=-1, keepdim=True) 
y = y / y.norm(dim=-1, keepdim=True)
similarity = x @y.T

Is that correct?

Dinosaurcubs commented 6 months ago

You don't need to do

image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)

if you're using cosine_similarity. torch.cosine_similarity(x, y) already normalizes the inputs, by nature of cosines. To normalize via x = x / x.norm() , and then running that through cosine_similarity, would be to be dividing by norm Twice. That would seriously degrade the data going through and damage the accuracy

If I just want to use the visual encoder to get the output visual featuere for downstream tasks,is it necessay to add the 'image_features /= image_features.norm(dim=-1, keepdim=True)'?