Hi guys, I am trying to visualize the attention map of the pre-trained model Blip2-opt-6.7b.
I set the flags related to attention output to True and successfully got cross_attentions from the output object BaseModelOutputWithPoolingAndCrossAttentions. The length of cross attentions is 6 and each cross attention shape is batch size (1) x 12 x 257 x 32. All of them were consistent with config files (number of heads, patch size x patch size + 1, number of query tokens). So I switched the last two dimensions, averaged over query tokens and heads, then reshaped it to 16 x 16, and visualized it over the original image. However, the visualizations looked non-sense:
Hi guys, I am trying to visualize the attention map of the pre-trained model Blip2-opt-6.7b.
I set the flags related to attention output to True and successfully got cross_attentions from the output object BaseModelOutputWithPoolingAndCrossAttentions. The length of cross attentions is 6 and each cross attention shape is batch size (1) x 12 x 257 x 32. All of them were consistent with config files (number of heads, patch size x patch size + 1, number of query tokens). So I switched the last two dimensions, averaged over query tokens and heads, then reshaped it to 16 x 16, and visualized it over the original image. However, the visualizations looked non-sense:
Could anyone give me some advice?