Closed Zhong1015 closed 7 months ago
Hi @Zhong1015 , Thank you for your continued interest in iRPE : )
rpe_config
.
https://github.com/microsoft/Cream/blob/main/iRPE/DETR-with-iRPE/models/transformer.py#L75-L77Thank you for your reply. According to your reply, I considered that for Q (Query) and K (Key), adding relative positional encoding would have a more significant impact. This is because in the self-attention mechanism, the similarity between Query and Key determines the weighting of attention distribution. Adding relative positional encoding solely to the value doesn't make much sense, as it does not directly participate in the calculation of attention weights. In addition, there is no need to add relative positional encoding to the qk in the cross-attention section, because the query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. Furthermore, there is also no need to add relative positional encoding to the qk in the cross-attention section, because: ① The query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. ②Applying relative positional encoding solely to the values in the cross-attention would not yield significant benefits either. Are my understandings correct? Looking forward to your reply.
I represent the implementation of relative positional encoding with a sketch. Is this the process if IRPE is simultaneously applied to qkv? (Contextual mode)
Thank you for your reply. According to your reply, I considered that for Q (Query) and K (Key), adding relative positional encoding would have a more significant impact. This is because in the self-attention mechanism, the similarity between Query and Key determines the weighting of attention distribution. Adding relative positional encoding solely to the value doesn't make much sense, as it does not directly participate in the calculation of attention weights. In addition, there is no need to add relative positional encoding to the qk in the cross-attention section, because the query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. Furthermore, there is also no need to add relative positional encoding to the qk in the cross-attention section, because: ① The query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. ②Applying relative positional encoding solely to the values in the cross-attention would not yield significant benefits either. Are my understandings correct? Looking forward to your reply.
Sorry for late reply. Your understandings are correct.
I represent the implementation of relative positional encoding with a sketch. Is this the process if IRPE is simultaneously applied to qkv? (Contextual mode)
Great work! It is notable that the values of RPE-Q, RPE-K and RPE-V are changed with the relative position.
Thank for your reply!
Hello, I am still following your excellent work. Currently, I have two questions:
①I am contemplating the application of IRPE on DETR. I am curious why you do not utilize IRPE on the kv (key-value) pairs in the cross-attention part. I understand that the query part in cross-attention consists of learnable embedding vectors, hence you did not apply IRPE on the query part of cross-attention. Additionally, you did not apply IRPE in a contextual manner on the key (as applying IRPE on the key requires information from the query part). In that case, why haven't you attempted to apply IRPE with a bias approach on the key in the cross-attention part?
②Your IRPE model parameter definitions are within the Transformer class. How do you employ different settings for the encoder and decoder separately?