Open Catchip opened 1 week ago
We conducted a preliminary investigation to determine the optimal $k$ for the last $k$ self-attention layers. (Not included in the paper) While the differences in results were not substantial, using all layers proved to be the best choice.
We employ a gating mechanism for each layer, allowing us to indirectly assess the importance of each layer through the gate values. The experimental results for this can be found in Supplementary Figure 2. Interestingly, the gate value for the first self-attention layer (prior to any cross-attention layer being applied) was notably high.
The idea of extracting relationships from self-attention weights is indeed very inspiring! However, I have some questions. Firstly, I must clarify that my understanding of DETR is not very deep, but from what I understand, the object queries output by the later layers are more accurate. DETR typically uses the object queries from the final layer to regress the final bounding boxes. So, in EGTR, why don't you directly use the self-attention weights from the final layer for relationship extraction? Have you conducted any ablation studies to investigate this issue?