sangminwoo / AvisC

Official pytorch implementation of "Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models"
https://sangminwoo.github.io/AvisC
MIT License
8 stars 0 forks source link

Difference between zeros and masking #5

Closed zifuwan closed 1 week ago

zifuwan commented 2 weeks ago

Hi, what is the difference here between zeros and mask? From the code it seems like the two schemes both zero out blind tokens.

image
sangminwoo commented 2 weeks ago

Hi @zifuwan,

In the “Zeros / Ones / Noise”, non-blind image tokens are replaced with values of zero, one, or Gaussian noise. Contrastive decoding is then applied to heighten the impact of these non-blind tokens while diminishing the influence of blind tokens.

On the other hand, “Mask” involves masking these tokens within the attention mechanism itself, rather than directly assigning specific values to them. By doing so, we prevent the attention layers from focusing on these tokens, thereby deactivating them in an indirect way.

We recognize that the explanation of the “Mask” is not sufficient in this version of the paper, and we’ll be revising it soon.

Thank you for your attention to this!

zifuwan commented 2 weeks ago

Thanks for the reply. Is the "Mask" implementation somewhere else? From the figure "Mask" seems the same as "Zeros".

image
sangminwoo commented 1 week ago

Hi @zifuwan,

It seems that we've missed updating the “mask” part of the code while organizing it for the experiment. We’ll make sure to update it soon. Conceptually, as mentioned before, "mask" approach modifes the attention mechanism itself.

Thanks a lot for reviewing the code and providing valuable feedback!