rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.42k stars 213 forks source link

Image embedding significantly different from openai clip library #290

Closed junwang-wish closed 1 year ago

junwang-wish commented 1 year ago

clip-retrieval:

echo 'https://placekitten.com/200/305' >> tmp.txt
# I don't want to resize image
img2dataset --url_list=tmp.txt --output_folder=image_folder --thread_count=64 --image_size=9999999 --resize_only_if_bigger true
# use clip-l-14-336px
clip-retrieval inference --input_dataset image_folder --output_folder embeddings_folder --clip_model ViT-L/14@336px --batch_size 1

I get

> np.load('embeddings_folder/img_emb/img_emb_0.npy')[:,:10]
array([[-0.007797,  0.002708,  0.03284 , -0.013176,  0.04205 ,  0.0029  ,
         0.02792 ,  0.00764 , -0.01304 , -0.03763 ]], dtype=float16)

open clip library:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-L/14@336px", device=device)
# use just downloaded image
image = preprocess(Image.open("image_folder/00000/000000000.jpg")).unsqueeze(0).to(device)
image_features = model.encode_image(image)

I get

> image_feature[:,:10]
tensor([[-0.1623,  0.0564,  0.6835, -0.2743,  0.8753,  0.0604,  0.5812,  0.1591,
         -0.2714, -0.7833]], grad_fn=<SliceBackward0>)

I would understand that the numbers can be a little bit off due to quantization and stuff, but the numbers here just just way off, is it expected?

junwang-wish commented 1 year ago

Whoops just realized that clip-retrieval does a default L2 normalization to unit norm which caused such difference, is there a way to turn such normalization off?

rom1504 commented 1 year ago

There's no option for it currently but you could add one

Note that if you want to use the embeddings in a knn index, it's better to keep the normalization so dot product are cosine distances

On Tue, Jun 27, 2023, 00:45 junwang-wish @.***> wrote:

Whoops just realized that clip-retrieval does a default L2 normalization to unit norm which caused such difference, is there a way to turn such normalization off?

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/290#issuecomment-1608420004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437VGLI7QYXEALEDYZWTXNIGKPANCNFSM6AAAAAAZUY3H44 . You are receiving this because you are subscribed to this thread.Message ID: @.***>