Why all rows of attention map are the same (precision=3)?

I use the extracted RoI image features same as ROSITA. And I directly concat the 2048-d image features and the 6-D box features (x_min, y_min, x_max, y_max, height, width) as the input image features. Setting output_attention=True and use the pretrained Oscar model (base-vg-labels), I found that each row of the attention map(from last layer) is the same(sum attention scores of all heads), like following the results is strange. Can someone explain this problem?

microsoft / Oscar

Why all rows of attention map are the same (precision=3)? #175