Open jiugexuan opened 1 month ago
Please refer to the DETR paper. The self-attention of DETR decoder (not masked) is different from that of the original Transformer decoder (masked).
The difference with the original transformer is that our model decodes the N objects in parallel at each decoder layer,
while Vaswani et al. [47] use an autoregressive model that predicts the output sequence one element at a time.
so the q,k used for relation from the first attn_layer transformer decoder ? from here?
Yes, that's right.
In paper:
We propose a novel lightweight relation extractor, EGTR, which exploits the self-attention of DETR decoder, as depicted in Fig. 3. Since the self-attention weights in Eq. (1) contain N × N bidirectional relationships among the N object queries, our relation extractor aims to extract the predicate information from the self-attention weights in the entire L layers, by considering the attention queries and keys as subjects and objects, respectively.
Is the self-attention of DETR decoder the mask multi-attention layer in transformer decoder?