naver-ai / egtr

[CVPR 2024 Best paper award candidate] EGTR: Extracting Graph from Transformer for Scene Graph Generation
https://arxiv.org/abs/2404.02072
Apache License 2.0
78 stars 2 forks source link

Why not just using final yayer self-attention weights for relationship extraction in EGTR #13

Open Catchip opened 1 week ago

Catchip commented 1 week ago

The idea of extracting relationships from self-attention weights is indeed very inspiring! However, I have some questions. Firstly, I must clarify that my understanding of DETR is not very deep, but from what I understand, the object queries output by the later layers are more accurate. DETR typically uses the object queries from the final layer to regress the final bounding boxes. So, in EGTR, why don't you directly use the self-attention weights from the final layer for relationship extraction? Have you conducted any ablation studies to investigate this issue?

jinbae commented 1 day ago

We conducted a preliminary investigation to determine the optimal $k$ for the last $k$ self-attention layers. (Not included in the paper) While the differences in results were not substantial, using all layers proved to be the best choice.

We employ a gating mechanism for each layer, allowing us to indirectly assess the importance of each layer through the gate values. The experimental results for this can be found in Supplementary Figure 2. Interestingly, the gate value for the first self-attention layer (prior to any cross-attention layer being applied) was notably high.