Thanks for publish the excellent work! I have following questions:
Based on the ReadMe usage2, we can achieve the fusion feature(based on reference image + modified text), how do you calculate the distance between the fusion feature and target image feature? Do you calculate it based on cosine similarity or euclidean distance?
Do we use the clip embedding for the target image?
Hello there,
Thanks for publish the excellent work! I have following questions:
Best