shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
MIT License
244 stars 22 forks source link

Could you provide the script to plot the attention map? #14

Closed Rainlt closed 4 months ago

Rainlt commented 5 months ago

The visualization of the attention map of each token in the paper is pretty. How did you draw it? Could you provide the script you used?

shikiw commented 4 months ago

Hi, thanks for your appreciation!

We have uploaded the visualization file here.

This visualization is simply achieved by setting the parameter output_attentions=True in the transformer's generate function, and then visualizing the out.attentions in the output. During visualization, since there are 32 heads, we directly take the max over the head dimension, and take the attention map of the last layer.

Note: The model.generate_output function is essentially equivalent to the model.generate function on our repo (here, the generate function is a wrapped function on top of the transformer's generate function, so be aware of the difference between the two). It directly outputs output_ids. For example, in the generate function within OPERA/minigpt4/models/llava.py, if you comment out lines 237-243 and then directly return output_ids, that would be the generate_output function needed in vis.ipynb.

Rainlt commented 4 months ago

Hi, thanks for your appreciation!

We have uploaded the visualization file here.

This visualization is simply achieved by setting the parameter output_attentions=True in the transformer's generate function, and then visualizing the out.attentions in the output. During visualization, since there are 32 heads, we directly take the max over the head dimension, and take the attention map of the last layer.

Note: The model.generate_output function is essentially equivalent to the model.generate function on our repo (here, the generate function is a wrapped function on top of the transformer's generate function, so be aware of the difference between the two). It directly outputs output_ids. For example, in the generate function within OPERA/minigpt4/models/llava.py, if you comment out lines 237-243 and then directly return output_ids, that would be the generate_output function needed in vis.ipynb.

Thank you very much!

minhoooo1 commented 4 months ago

Hi, thanks for your appreciation!

We have uploaded the visualization file here.

This visualization is simply achieved by setting the parameter output_attentions=True in the transformer's generate function, and then visualizing the out.attentions in the output. During visualization, since there are 32 heads, we directly take the max over the head dimension, and take the attention map of the last layer.

Note: The model.generate_output function is essentially equivalent to the model.generate function on our repo (here, the generate function is a wrapped function on top of the transformer's generate function, so be aware of the difference between the two). It directly outputs output_ids. For example, in the generate function within OPERA/minigpt4/models/llava.py, if you comment out lines 237-243 and then directly return output_ids, that would be the generate_output function needed in vis.ipynb.

if you comment out lines 237-243 and then directly return output_ids, That's right, but output_ids is just a tensor' object which has no attribute 'attentions. Even output_attentions=True.

Rainlt commented 4 months ago

Hi, thanks for your appreciation! We have uploaded the visualization file here. This visualization is simply achieved by setting the parameter output_attentions=True in the transformer's generate function, and then visualizing the out.attentions in the output. During visualization, since there are 32 heads, we directly take the max over the head dimension, and take the attention map of the last layer. Note: The model.generate_output function is essentially equivalent to the model.generate function on our repo (here, the generate function is a wrapped function on top of the transformer's generate function, so be aware of the difference between the two). It directly outputs output_ids. For example, in the generate function within OPERA/minigpt4/models/llava.py, if you comment out lines 237-243 and then directly return output_ids, that would be the generate_output function needed in vis.ipynb.

if you comment out lines 237-243 and then directly return output_ids, That's right, but output_ids is just a tensor' object which has no attribute 'attentions. Even output_attentions=True.

yes, the code in the vis.ipynb may be incomplete. I have tried to modify the code and now it can output attention.

Maybe you shuld add complete parameters in the second generate function in the vis.ipynb, including parameter that return_dict_in_generate=True, which control the attention output, as shown in transformers/generation/utils.py.

Part of the code is as follow:

qu_append = out[0]
qu = qu + qu_append

with torch.inference_mode():
    with torch.no_grad():
        out = model.generate_output(
                {"image": norm(image), "prompt":qu},
                use_nucleus_sampling=False,#args.sample, 
                num_beams=1,#args.beam,
                max_new_tokens=512,
                output_attentions=True,
                return_dict_in_generate=True,
            )

Additionally, you should add 'return_dict_in_generate' in the definition of the generate_output function in llava.py, and in the corresponding code, such as output_ids = self.llama_model.generate(..., return_dict_in_generate=True, .... ). Finally, you can get the attention from the output_ids

minhoooo1 commented 4 months ago

@Rainlt Thank you for your reply. In addition to the modifications you mentioned, we also need to change the out=model. generate-output (..) in vis. ipynb to out=model. generate (..). By the way, has the visual attention been scaled up here, as mentioned in your paper, "Scaling up the attention values as the values are usually too small"

Rainlt commented 4 months ago

@Rainlt Thank you for your reply. In addition to the modifications you mentioned, we also need to change the out=model. generate-output (..) in vis. ipynb to out=model. generate (..). By the way, has the visual attention been scaled up here, as mentioned in your paper, "Scaling up the attention values as the values are usually too small"

Aha, I'm afraid that I'm not the author. In addition, the attention values will be multiplied by 5 indeed in this script.

minhoooo1 commented 4 months ago

@Rainlt Thank you for your reply sincerely, I did notice multiplying by 5.

Rainlt commented 4 months ago

@Rainlt Thank you for your reply sincerely, I did notice multiplying by 5.

That's OK.

Ivesfu commented 4 months ago

@Rainlt Hello, I try to visualize it but encounter some problems.... Could you please share your code after fix? Thanks a lot!!!!