Open smith-co opened 2 years ago
let's say you have 10 different images. You want to check which one is a duplicate or not.
Just use model.encode_image(img)
to encode the 10 images, you will have a tensor with size 10 x 512.
Then check for cosine similarity between the vectors. Put a threshold, maybe 0.95, so if two image have cosine similarity between embedding > threshold, then label it as duplicate / near duplicate depending on your definition.
Computing model.encode_image(img)
and then using cosine_similarity
worked out remarkably
well for me for general images.
However, for the specific image
dataset I am trying the above approach is not working out so well. So it appears I have to fine tune the model following https://github.com/openai/CLIP/issues/83.
But for CLIP fine tuning I would always need (image, text) pair. Or for near-duplicate detection, I can fine tune differently?
If you are trying to create specific duplicate detection on a dataset where the definition of duplicate does not follow the general definition of duplicate, then I think it is possible to train the model for a specific context. but you need a dataset of pairs of images and label whether they are duplicates or not and you need to customize the training code a bit. So the format of data would be like this :
img1 | img2 | is_duplicate |
---|---|---|
1.jpg | 2.jpg | 1 |
3.jpg | 4.jpg | 1 |
5.jpg | 6.jpg | 0 |
the changes you need to make are :
model(img,text)
, you need to manually create cosine similarity matrix using model.encode_image(img1)
and model.encode_image(img2)
torch.arange
to follow the label you have in your dataset.or try to use larger model and see whether a larger model can create better embedding for your need
@vinson2233 you are absolutely right - I am trying to create specific duplicate detection on a dataset.
the changes you need to make are :
- instead using
model(img,text)
, you need to manually create cosine similarity matrix usingmodel.encode_image(img1)
andmodel.encode_image(img2)
- Change the ground truth from
torch.arange
to follow the label you have in your dataset.
I have been looking into the steps listed here https://github.com/openai/CLIP/issues/83. Have not really followed what changes you referred to here.
Alternately, I suppose I could also create a (img,text) pair like this:
image | text |
---|---|
1.jpg | tag1 |
2.jpg | tag1 |
3.jpg | tag1 |
4.jpg | tag1 |
5.jpg | tag2 |
5.jpg | tag3 |
as 1.jpg
, 2.jpg
, 3.jpg
, 4.jpg
are near duplicates I am thinking to use same tag i.e. tag1
. I presume this should work as well?
or try to use larger model and see whether a larger model can create better embedding for your need
What model are you referring to here?
What model are you referring to here?
the encoder variations
https://github.com/openai/CLIP/blob/b46f5ac7587d2e1862f8b7b1573179d80dcdd620/clip/clip.py#L30-L40
as 1.jpg, 2.jpg, 3.jpg, 4.jpg are near duplicates I am thinking to use same tag i.e. tag1. I presume this should work as well?
sure, this is also possible, so you are trying to catalog similar images under 1 category. But you need to modify the loss function calculation, to only calculate the loss on the image axis, not on the text axis because now one text can fall into multiple categories. Another alternative is in each batch, make sure to only get 1 member from each tag, that way you can calculate loss from both axes. I never tried this. so good luck.
Given pair of images, my use case is to detect whether they are duplicate or not.
(imageX, imageY) = verdict/score verdict = duplicate/not duplicate/near duplicate
How can I use CLIP for this use case?