navervision / CompoDiff

Official Pytorch implementation of "CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion" (TMLR 2024)
https://huggingface.co/navervision/CompoDiff-Aesthetic
Apache License 2.0
73 stars 3 forks source link

How to calculate the distance between composed feature and target image feature? #4

Open SunTongtongtong opened 1 year ago

SunTongtongtong commented 1 year ago

Hello there,

Thanks for publish the excellent work! I have following questions:

  1. Based on the ReadMe usage2, we can achieve the fusion feature(based on reference image + modified text), how do you calculate the distance between the fusion feature and target image feature? Do you calculate it based on cosine similarity or euclidean distance?
  2. Do we use the clip embedding for the target image?

Best

SanghyukChun commented 11 months ago
  1. It is cosine similarity, i.e., Euclidean distance after L2 normalization. You can check the details in the code
  2. Yes. We directly compute the similarity between edited feature (query image & query text) and the CLIP visual feature (target image)