shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
MIT License
244 stars 22 forks source link

Attention map plotting #10

Closed franciscoliu closed 5 months ago

franciscoliu commented 5 months ago

Dear authors,

Thank you for this wonderful paper! I reproduced your Figure 2 (attention map from InstructBLIP) and got the following result. I did not notice that an outstanding pattern that highlighted by the article (in red box). Compared with the word "Additionally", it seems like the word "that" has much more impacts. Do you have any ideas where might went wrong? Thank you for your help.

rejected_attention_map

shikiw commented 5 months ago

Hi,

Thanks for your appreciation! Sorry for the poor writing of our paper that leads to your misunderstanding!

This is a good question! In fact, what we want to claim is "the appearance of aggregation patterns in the context makes it more easily to induce hallucinated contents in the subsequent tokens", but not "all subsequent tokens of the aggregation pattern should be hallucinations", they are different. As we stated in the paper, aggregation patterns are the nature of LLM, while this nature become the cause of current MLLM's hallucination (because the vision tokens will be gradually attenuated in the information flow, especially when the context gets much longer).

Now, you may understand the key point is the existence of aggregation pattern but not which token does the aggregation pattern located. The reasons of the difference between your visualization and our Figure 2 are complicated, might be different machine, different environment, or different sequences (I notice that you directly copy the IntructBLIP's answer in Figure 2, however, this answer is not complete since we omitted part of the sentence with the ellipsis). In general, we don't care which token does the pattern appear, maybe it appears at "_him" or "_from" at the next time, it is not surprising (btw, according to our observation, the pattern is more likely to appear at ".", "'", and "\n"). We just care that, if the hallucinations are more likely to be induced when more and more patterns appear in the context.

I hope this helps you well :)