hutaiHang / ToMe

[NeurIPS 2024] Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
https://arxiv.org/abs/2411.07132
32 stars 0 forks source link

the operation of token merging is redundant? #2

Open hiker-lw opened 3 days ago

hiker-lw commented 3 days ago

I’m glad to see your interesting work. Although I’m not from your field, I frequently follow research on this topic. There is one key aspect of your method that I don’t fully understand: the necessity of the Semantic Binding Loss. Why not directly use the text embedding of the object phrase as the embedding for the compositional token in your paper? Isn’t the token merge operation somewhat redundant? Your ablation experiments also suggest that token merging doesn’t have much effect. This is likely because the text encoder is causal. For example, in the sentence you used in your paper, ‘a cat wearing sunglasses and a dog with a hat,’ the feature of 'dog' will also contain features of 'cat.' As a result, the token merge doesn’t work well. So why not directly replace the 'dog' with the features of ‘a dog with a hat’ instead?

hutaiHang commented 3 days ago
  1. Why not directly use the text embedding of the object phrase Please refer to Appendix D.2 of our paper and Figure 13, we find that directly replacing the prompt with another phrase results in a loss of subject;

  2. Only token merge doesn’t work well Although the gain from using token merging alone is weak, token merging serves as the foundation for subsequent optimizations. The ablation experiment on Config. D illustrates this point;

hiker-lw commented 3 days ago
  1. Why not directly use the text embedding of the object phrase Please refer to Appendix D.2 of our paper and Figure 13, we find that directly replacing the prompt with another phrase results in a loss of subject;
  2. Only token merge doesn’t work well Although the gain from using token merging alone is weak, token merging serves as the foundation for subsequent optimizations. The ablation experiment on Config. D illustrates this point;

Thanks for your reply!

  1. Why combine the embeddings of the three tokens “cat wearing glasses” to represent the embedding of "cat"? What I mean is, why not directly use "a cat wearing glasses" as the embedding for "cat"? Logically, using the embedding of the object phrase directly as the embedding for the object should not cause a sub-object loss.
  2. Using Semantic Binding Loss without performing token merging will naturally result in poor performance. Take this example: “a cat wearing sunglasses and a dog with a hat.” The original embedding of "dog" does not contain information about “wearing” or “sunglasses.” Now, if you force the dog’s embedding to incorporate the full information of "a cat wearing sunglasses" via Semantic Binding Loss, that’s obviously unreasonable, and it will lead to poor performance.
hutaiHang commented 3 days ago
  1. Regardless of which embedding(embeddings of the three tokens or pure phrase embedding) is used as a direct replacement, it may lead to the disappearance of the subject. A concurrent work[1] also found this. (see table3). Using token merging is by no means unnecessary :) [1] A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization

  2. Additionally, you may have misunderstood our ablation experiments. In Config D/E, we are not aligning the 'dog' embedding in original prompt 'a cat wearing sunglasses and a dog with a hat' with the pure phrase 'a dog with a hat', but rather aligning 'a dog with a hat' in original prompt with its pure phrase embedding.

In fact, during the initial exploration of this work, we tried various token embedding replacement strategies (which were not presented in the paper), but none were as effective as current approach. Thank you for your attention to our work.

hiker-lw commented 2 days ago

Thanks for your quick explanations.

  1. "Disappearance of the subject": In the example "a cat wearing sunglasses and a dog with a hat", are you suggesting that the disappearance of the "dog" will occur? I believe this might be due to the casual attention of the text encoder. Logically, there should be no disappearance of either the object or the subject.

  2. In your paper, you mention, "To ensure that the semantics of the composite tokens correspond accurately to the noun phrases they are meant to represent, we employ a clean prompt as a supervisory signal". Based on the figure illustrating your method, it seems you are aligning the composite token "dog" embedding with the phrase "a dog with a hat". Is that correct?

hiker-lw commented 2 days ago

I think your work is interesting, and I just want to have a discussion about my confusion.

hutaiHang commented 2 days ago

These two confusions have been talked about in my previous response.

  1. Direct replacement leads to the disappearance of the subject, which could be either a cat or a dog; this was observed in Figure 13 and some concurrent work[1](please see table 3 of the paper). This may be due to the lack of contextual awareness between different subjects caused by direct replacement, and we did not delve into the reasons behind this phenomenon.

    [1] A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization

  2. What I mean is that in Config D/E of the ablation study, since token merge is not used at this configuration, the optimization target for semantic binding loss is the entire phrase corresponding to the original prompt. With token merge applied, the optimization target becomes the composite tokens 'cat' and 'dog'. This ablation study aims to demonstrate that token merging serves as the foundation for subsequent optimizations. In your previous reply, you think:

    Now, if you force the dog’s embedding to incorporate all information about "a cat wearing sunglasses" via Semantic Binding Loss, that’s obviously unreasonable and will lead to poor performance.

    This may reflect a misunderstanding of our ablation study on your part.

Thank you for your attention to our work.

hiker-lw commented 2 days ago

Thanks for your patient reply.

  1. I no longer have any questions about the first issue. Thanks for your explanations.
  2. As for the second issue, my understanding is that, in the context of config D/E, you are aligning each token of 'a cat wearing sunglasses' from the original prompt with independent phase of 'a cat wearing glasses'. Is that right? Perhaps I've misunderstood, so could you please explain your approach of config D/E more clearly?

'Now, if you force the dog’s embedding to incorporate all information about "a cat wearing sunglasses" via Semantic Binding Loss, that’s obviously unreasonable and will lead to poor performance' Sorry, I made a typo error. It should be: 'Now, if you force the cat's embedding to incorporate all information about "a cat wearing sunglasses" via Semantic Binding Loss, that’s obviously unreasonable and will lead to poor performance'