Open smith-co opened 2 years ago
Given pairs of (image,text), I have to detect near duplicate using both features.
Pairs
(image1, text1) (image2, text2) ... (imageN, textN)
I am thinking to compute embedding using both image and text features:
multimodal_feature_1 = model(image1, text1, mode='multimodal')[0,0] multimodal_feature_2 = model(image2, text2, mode='multimodal')[0,0] matching_score = cosine_similarity(multimodal_feature_1, multimodal_feature_2)
Any feedback about this approach?
Also I would like to know is there a length limit for text that I should be aware of?
text
Related:
Hi, note that the multimodal feature has not been optimized for cosine-similarity. The unimodal features can be used to compute cosine-similarity because of the image-text contrastive loss.
Given pairs of (image,text), I have to detect near duplicate using both features.
Pairs
I am thinking to compute embedding using both image and text features:
Any feedback about this approach?
Also I would like to know is there a length limit for
text
that I should be aware of?