Closed zhangxin-xd closed 2 months ago
I have noted some ablation study about the similarity matrix. Could you provide some insights why learnable one performs better than post-calculated one.
Thanks.
Hi! That's a good question. Besides the reasons mentioned in the ablation, I think the learnable similarity also helps the distillation process of image and text. This "soft" similarity is a more flexible and precise metric than the identity similarity and could more accurately guide the alignment of image and text.
(Plus, i'm not a fan of using pretrained model in DD, as it seems not very fair.)
That makes sense, thank you for the response!
Hi, thanks for sharing this amazing work.
I have a question regarding the learnable similarity matrix S. I’m curious about the decision to make it learnable. Considering that we can easily compute the similarity between cross-modal items after generating the synthesis image and text,
Looking forward to your reply.