Closed zifuwan closed 1 week ago
Hi @zifuwan,
In the “Zeros / Ones / Noise”, non-blind image tokens are replaced with values of zero, one, or Gaussian noise. Contrastive decoding is then applied to heighten the impact of these non-blind tokens while diminishing the influence of blind tokens.
On the other hand, “Mask” involves masking these tokens within the attention mechanism itself, rather than directly assigning specific values to them. By doing so, we prevent the attention layers from focusing on these tokens, thereby deactivating them in an indirect way.
We recognize that the explanation of the “Mask” is not sufficient in this version of the paper, and we’ll be revising it soon.
Thank you for your attention to this!
Thanks for the reply. Is the "Mask" implementation somewhere else? From the figure "Mask" seems the same as "Zeros".
Hi @zifuwan,
It seems that we've missed updating the “mask” part of the code while organizing it for the experiment. We’ll make sure to update it soon. Conceptually, as mentioned before, "mask" approach modifes the attention mechanism itself.
Thanks a lot for reviewing the code and providing valuable feedback!
Hi, what is the difference here between zeros and mask? From the code it seems like the two schemes both zero out blind tokens.