In your code, there is getAttMap function (lavis.common.gradcam.getAttMap).
When we use this, we get the gradient of the cross attention values.
What is the different between getting just attention versus the gradient of the attention?
What I understood is bringing only attention scores can show the relevance between image features and text features.
Why do we need to get the gradient value?
When we use GradCAM at CNN models, we use the gradient because we can get the important region by backpropagating all the layers. However, the function here, I am wonder whether it also can backpropagate to the vision transformer beyond the cross attention map.
Also I found that in the image-text matching model, CAM is static and only the gradient is variable, resulting in the change of gradcam for each token. Do you know why?
I have a question about GradCAM applied in BLIP.
In your code, there is getAttMap function (lavis.common.gradcam.getAttMap). When we use this, we get the gradient of the cross attention values. What is the different between getting just attention versus the gradient of the attention?
What I understood is bringing only attention scores can show the relevance between image features and text features. Why do we need to get the gradient value?
When we use GradCAM at CNN models, we use the gradient because we can get the important region by backpropagating all the layers. However, the function here, I am wonder whether it also can backpropagate to the vision transformer beyond the cross attention map.
I will appreciate your comments here.