Spatial attention visualization

Update

I found a way to collect the attention layers and attention levels for each inference on my images (cross_attn to build an image to help "see" my detections)

Inference with torch.no_grad(): x = self.model.backbone(image) x = self.model.encoder(x) _, spatial_shapes = self.model.decoder._get_encoder_input(x) # get the spatial shapes for tensor decomposition _ = self.model.decoder(x, goals) Hook def register_hooks(self): for name, module in self.model.named_modules(): if 'cross_attn' in name: module.register_forward_hook(self.get_attention_hook)

For each inference with a batch of an image, I get 4 groups of attention layers (related to RTDETRTransformerv2::num_layers) The first tensor of the first layer can be decomposed with num_levels (RTDETRTransformerv2::num_levels)

I have new questions about this

do you think cross-attention is suitable to see how the model perceives objects?
is it a good practice to merge attention levels and attention layers to get this result? with the same coefficients?
are there better ways to get a visual of the attention on objects?

Thanks for your advice S.

lyuwenyu / RT-DETR

Spatial attention visualization #472