Starting from the tutorial link and considering the function compute_gradcam in BlipITM link I'm trying to obtain the same result but using Blip2ITM. Function getAttMap is at link.
Where I considered as target layer model.Qformer.bert.encoder.layer[10]. What I got is different from BlipITM is that cams and grads have a dynamical shape [1, 12, N, 577], where N is the number of tokens of the input text.
Instead, in Blip2ITM the QFormer appears to be instantiated with num_query_token=32. So now grads and cams are always in the form of [1, 12, 32, 257].
Starting from the tutorial link and considering the function compute_gradcam in BlipITM link I'm trying to obtain the same result but using Blip2ITM. Function getAttMap is at link.
This is my code:
Where I considered as target layer model.Qformer.bert.encoder.layer[10]. What I got is different from BlipITM is that cams and grads have a dynamical shape [1, 12, N, 577], where N is the number of tokens of the input text.
Instead, in Blip2ITM the QFormer appears to be instantiated with num_query_token=32. So now grads and cams are always in the form of [1, 12, 32, 257].
For example using that input text, I got:
So to multiply grads cams mask I tried to consider only the first N (mask.shape[2]):
Doing this I got no error but the Grad-CAM is awful and doesn't make sense at all. What's wrong with this?