Open hiker-lw opened 3 days ago
Why not directly use the text embedding of the object phrase Please refer to Appendix D.2 of our paper and Figure 13, we find that directly replacing the prompt with another phrase results in a loss of subject;
Only token merge doesn’t work well Although the gain from using token merging alone is weak, token merging serves as the foundation for subsequent optimizations. The ablation experiment on Config. D illustrates this point;
- Why not directly use the text embedding of the object phrase Please refer to Appendix D.2 of our paper and Figure 13, we find that directly replacing the prompt with another phrase results in a loss of subject;
- Only token merge doesn’t work well Although the gain from using token merging alone is weak, token merging serves as the foundation for subsequent optimizations. The ablation experiment on Config. D illustrates this point;
Thanks for your reply!
Regardless of which embedding(embeddings of the three tokens or pure phrase embedding) is used as a direct replacement, it may lead to the disappearance of the subject. A concurrent work[1] also found this. (see table3). Using token merging is by no means unnecessary :) [1] A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization
Additionally, you may have misunderstood our ablation experiments. In Config D/E, we are not aligning the 'dog' embedding in original prompt 'a cat wearing sunglasses and a dog with a hat' with the pure phrase 'a dog with a hat', but rather aligning 'a dog with a hat' in original prompt with its pure phrase embedding.
In fact, during the initial exploration of this work, we tried various token embedding replacement strategies (which were not presented in the paper), but none were as effective as current approach. Thank you for your attention to our work.
Thanks for your quick explanations.
"Disappearance of the subject": In the example "a cat wearing sunglasses and a dog with a hat", are you suggesting that the disappearance of the "dog" will occur? I believe this might be due to the casual attention of the text encoder. Logically, there should be no disappearance of either the object or the subject.
In your paper, you mention, "To ensure that the semantics of the composite tokens correspond accurately to the noun phrases they are meant to represent, we employ a clean prompt as a supervisory signal". Based on the figure illustrating your method, it seems you are aligning the composite token "dog" embedding with the phrase "a dog with a hat". Is that correct?
I think your work is interesting, and I just want to have a discussion about my confusion.
These two confusions have been talked about in my previous response.
Direct replacement leads to the disappearance of the subject, which could be either a cat or a dog; this was observed in Figure 13 and some concurrent work[1](please see table 3 of the paper). This may be due to the lack of contextual awareness between different subjects caused by direct replacement, and we did not delve into the reasons behind this phenomenon.
[1] A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization
What I mean is that in Config D/E of the ablation study, since token merge is not used at this configuration, the optimization target for semantic binding loss is the entire phrase corresponding to the original prompt. With token merge applied, the optimization target becomes the composite tokens 'cat' and 'dog'. This ablation study aims to demonstrate that token merging serves as the foundation for subsequent optimizations. In your previous reply, you think:
Now, if you force the dog’s embedding to incorporate all information about "a cat wearing sunglasses" via Semantic Binding Loss, that’s obviously unreasonable and will lead to poor performance.
This may reflect a misunderstanding of our ablation study on your part.
Thank you for your attention to our work.
Thanks for your patient reply.
'Now, if you force the dog’s embedding to incorporate all information about "a cat wearing sunglasses" via Semantic Binding Loss, that’s obviously unreasonable and will lead to poor performance' Sorry, I made a typo error. It should be: 'Now, if you force the cat's embedding to incorporate all information about "a cat wearing sunglasses" via Semantic Binding Loss, that’s obviously unreasonable and will lead to poor performance'
I’m glad to see your interesting work. Although I’m not from your field, I frequently follow research on this topic. There is one key aspect of your method that I don’t fully understand: the necessity of the Semantic Binding Loss. Why not directly use the text embedding of the object phrase as the embedding for the compositional token in your paper? Isn’t the token merge operation somewhat redundant? Your ablation experiments also suggest that token merging doesn’t have much effect. This is likely because the text encoder is causal. For example, in the sentence you used in your paper, ‘a cat wearing sunglasses and a dog with a hat,’ the feature of 'dog' will also contain features of 'cat.' As a result, the token merge doesn’t work well. So why not directly replace the 'dog' with the features of ‘a dog with a hat’ instead?