microsoft / Oscar

Oscar and VinVL
MIT License
1.03k stars 248 forks source link

Why all rows of attention map are the same (precision=3)? #175

Open panmianzhi opened 2 years ago

panmianzhi commented 2 years ago

I use the extracted RoI image features same as ROSITA. And I directly concat the 2048-d image features and the 6-D box features (x_min, y_min, x_max, y_max, height, width) as the input image features. Setting output_attention=True and use the pretrained Oscar model (base-vg-labels), I found that each row of the attention map(from last layer) is the same(sum attention scores of all heads), like following image the results is strange. Can someone explain this problem?