salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.22k stars 915 forks source link

BLIP GradCAM #336

Open gwyong opened 1 year ago

gwyong commented 1 year ago

I have a question about GradCAM applied in BLIP.

In your code, there is getAttMap function (lavis.common.gradcam.getAttMap). When we use this, we get the gradient of the cross attention values. What is the different between getting just attention versus the gradient of the attention?

What I understood is bringing only attention scores can show the relevance between image features and text features. Why do we need to get the gradient value?

When we use GradCAM at CNN models, we use the gradient because we can get the important region by backpropagating all the layers. However, the function here, I am wonder whether it also can backpropagate to the vision transformer beyond the cross attention map.

I will appreciate your comments here.

gwyong commented 1 year ago

Also I found that in the image-text matching model, CAM is static and only the gradient is variable, resulting in the change of gradcam for each token. Do you know why?