How to use CLIP for duplicate or near-duplicate images?

openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

MIT License

25.67k stars 3.29k forks source link

How to use CLIP for duplicate or near-duplicate images? #260

Open smith-co opened 2 years ago

smith-co commented 2 years ago

Given pair of images, my use case is to detect whether they are duplicate or not.

(imageX, imageY) = verdict/score verdict = duplicate/not duplicate/near duplicate

How can I use CLIP for this use case?

vinson2233 commented 2 years ago

let's say you have 10 different images. You want to check which one is a duplicate or not. Just use model.encode_image(img) to encode the 10 images, you will have a tensor with size 10 x 512. Then check for cosine similarity between the vectors. Put a threshold, maybe 0.95, so if two image have cosine similarity between embedding > threshold, then label it as duplicate / near duplicate depending on your definition.

smith-co commented 2 years ago

Computing model.encode_image(img) and then using cosine_similarity worked out remarkably well for me for general images.

However, for the specific image dataset I am trying the above approach is not working out so well. So it appears I have to fine tune the model following https://github.com/openai/CLIP/issues/83.

But for CLIP fine tuning I would always need (image, text) pair. Or for near-duplicate detection, I can fine tune differently?

vinson2233 commented 2 years ago

If you are trying to create specific duplicate detection on a dataset where the definition of duplicate does not follow the general definition of duplicate, then I think it is possible to train the model for a specific context. but you need a dataset of pairs of images and label whether they are duplicates or not and you need to customize the training code a bit. So the format of data would be like this :

img1	img2	is_duplicate
1.jpg	2.jpg	1
3.jpg	4.jpg	1
5.jpg	6.jpg	0

the changes you need to make are :

instead using model(img,text), you need to manually create cosine similarity matrix using model.encode_image(img1) and model.encode_image(img2)
Change the ground truth from torch.arange to follow the label you have in your dataset.

vinson2233 commented 2 years ago

or try to use larger model and see whether a larger model can create better embedding for your need

smith-co commented 2 years ago

@vinson2233 you are absolutely right - I am trying to create specific duplicate detection on a dataset.

the changes you need to make are :

instead using model(img,text), you need to manually create cosine similarity matrix using model.encode_image(img1) and model.encode_image(img2)

Change the ground truth from torch.arange to follow the label you have in your dataset.

I have been looking into the steps listed here https://github.com/openai/CLIP/issues/83. Have not really followed what changes you referred to here.

Alternately, I suppose I could also create a (img,text) pair like this:

image	text
1.jpg	tag1
2.jpg	tag1
3.jpg	tag1
4.jpg	tag1
5.jpg	tag2
5.jpg	tag3

as 1.jpg, 2.jpg, 3.jpg, 4.jpg are near duplicates I am thinking to use same tag i.e. tag1. I presume this should work as well?

smith-co commented 2 years ago

or try to use larger model and see whether a larger model can create better embedding for your need

What model are you referring to here?

vinson2233 commented 2 years ago

What model are you referring to here?

the encoder variations
https://github.com/openai/CLIP/blob/b46f5ac7587d2e1862f8b7b1573179d80dcdd620/clip/clip.py#L30-L40

vinson2233 commented 2 years ago

as 1.jpg, 2.jpg, 3.jpg, 4.jpg are near duplicates I am thinking to use same tag i.e. tag1. I presume this should work as well?

sure, this is also possible, so you are trying to catalog similar images under 1 category. But you need to modify the loss function calculation, to only calculate the loss on the image axis, not on the text axis because now one text can fall into multiple categories. Another alternative is in each batch, make sure to only get 1 member from each tag, that way you can calculate loss from both axes. I never tried this. so good luck.