How to calculate the distance between composed feature and target image feature?

navervision / CompoDiff

Official Pytorch implementation of "CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion" (TMLR 2024)

Apache License 2.0

80 stars 3 forks source link

Open SunTongtongtong opened 1 year ago

SunTongtongtong commented 1 year ago

Hello there,

Thanks for publish the excellent work! I have following questions:

Based on the ReadMe usage2, we can achieve the fusion feature(based on reference image + modified text), how do you calculate the distance between the fusion feature and target image feature? Do you calculate it based on cosine similarity or euclidean distance？
Do we use the clip embedding for the target image?

Best

SanghyukChun commented 1 year ago

It is cosine similarity, i.e., Euclidean distance after L2 normalization. You can check the details in the code
Yes. We directly compute the similarity between edited feature (query image & query text) and the CLIP visual feature (target image)