rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.42k stars 213 forks source link

Laion embeddings and clip output for the same image do not match exactly #264

Closed stevebottos closed 1 year ago

stevebottos commented 1 year ago

This issue is related to an older one: https://github.com/rom1504/clip-retrieval/discussions/100 which didn't seem to be resolved entirely. I was able to replicate the results in this issue, but it's not a perfect solution (square pad images, use Resizer class prior to CLIP preprocess).

Anway, the laion website provides embeddings and parquets which tie an embedding at an index in the array to its associated metadata. In theory the clip output and the laion embedding for the same image should be exactly the same, but they're not. Here's how to reproduce:

from io import BytesIO

import clip
import numpy as np
import pandas as pd
import requests
import torch
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity

# embeddings: https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/img_emb/img_emb_0.npy
# parquet: https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/metadata/metadata_0.parquet

INDEX = 0

laion_embedding = np.load("img_emb_0.npy")[INDEX]
laion_embedding = np.expand_dims(laion_embedding, 0)
url = pd.read_parquet("metadata_0.parquet")["url"][INDEX]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

response = requests.get(url)
image = Image.open(BytesIO(response.content))
image = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    clip_embedding = model.encode_image(image).cpu().numpy()

print("Pre-norm difference:", (laion_embedding - clip_embedding).sum())
clip_embedding = clip_embedding / np.linalg.norm(clip_embedding)
print("Post-norm difference:", (laion_embedding - clip_embedding).sum())
print("Cosine sim:", cosine_similarity(laion_embedding, clip_embedding))

This outputs:

Pre-norm difference: -4.79
Post-norm difference: 0.1592
Cosine sim: [[0.91325109]]

I've verified that laion uses the ViT-B/32 backbone as well. I'm wondering what might be causing the discrepancy here. Any ideas? what I'm ultimately looking for is a lightweight addition to pre/post processing that will allow me to ensure that embeddings produced by novel queries are produced correctly.

rom1504 commented 1 year ago

What information is missing there https://github.com/rom1504/clip-retrieval/discussions/100#discussioncomment-2023462 ?

stevebottos commented 1 year ago

Thanks for the prompt reply.

I did end up trying to match the process you mention above, by using the resizer class to first resize to 256 and then using the clip transform to downsize to 224. This reduced the norm difference for the example above down to about 0.1, which is closer, so maybe the difference is just due to some variations in my image decoder library versions.

I tried this myself and can confirm that the difference between the two embeddings are reduced usually, but after running a test with hundreds of embeddings there are still some that aren't similar enough to call it a solution.

Regardless, I ended up using your img2dataset library to download most all of laion400m myself, and then I am using clip-retrieval inference script to generate the embeddings myself. Was very helpful to have such code available

He doesn't mention whether or not using the tool for download and inference produces an exact match. I'll give it a shot and report back with the results if there's interest/if it's unknown?

rom1504 commented 1 year ago

The answer is simple: if you want to reproduce exactly the embeddings I computed for laion400m you need to employ exactly the same preprocessing

That includes the default resizing of img2dataset

However there is 1% link rot per month so some images will be gone. Some images will have changed. If you want to discard changed images you may use the hashes we provide

rom1504 commented 1 year ago

What is your end goal?

stevebottos commented 1 year ago

Good to know about the link rot, that makes sense. I'll see what I can do, I'm poking around img2dataset as we speak. The end goal is just to ensure that novel queries are computed correctly, I'm using the known embeddings compared with the embeddings produced by my preprocess/inference functions to error check before I trust that novel embeddings are correct.

The end goal is to pull all embeddings into Milvus and use ANN search to grab similar images for novel queries, pretty much what you've done here except entirely offline and using ANN instead of KNN.

stevebottos commented 1 year ago

I'll probably end up pulling all the images onto a server somewhere as well to ensure they're always available.

rom1504 commented 1 year ago

Clip-retrieval and milvus both use approximate knn which i guess is what you call ANN Faiss is still today the best opensource implementation of knn.

On Fri, May 5, 2023, 21:05 Steve Bottos @.***> wrote:

I'll probably end up pulling all the images onto a server somewhere as well to ensure they're always available.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/264#issuecomment-1536659322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437XJSGT2A2RHGQWZJ5DXEVFQLANCNFSM6AAAAAAXWNVSV4 . You are receiving this because you commented.Message ID: @.***>

stevebottos commented 1 year ago

Got it. I'm just now starting to explore these vector dbs so the actual methods are still a bit fuzzy. In any case, I'll close this out, using the Resizer class from img2dataset before preprocessing produces close enough results as was mentioned in the previous issue I linked, although still not perfect matches. I appreciate the responses.