microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.69k stars 230 forks source link

Can irpe applied on IRPE? #230

Closed Zhong1015 closed 7 months ago

Zhong1015 commented 8 months ago

Hello, I am still following your excellent work. Currently, I have two questions:

①I am contemplating the application of IRPE on DETR. I am curious why you do not utilize IRPE on the kv (key-value) pairs in the cross-attention part. I understand that the query part in cross-attention consists of learnable embedding vectors, hence you did not apply IRPE on the query part of cross-attention. Additionally, you did not apply IRPE in a contextual manner on the key (as applying IRPE on the key requires information from the query part). In that case, why haven't you attempted to apply IRPE with a bias approach on the key in the cross-attention part?

②Your IRPE model parameter definitions are within the Transformer class. How do you employ different settings for the encoder and decoder separately?

wkcn commented 8 months ago

Hi @Zhong1015 , Thank you for your continued interest in iRPE : )

  1. I think that there is no relative relationship between query and key sequences in the cross-attention part.
  2. iRPE is only employed on the self-attention layer of the transformer encoder by passing the argument rpe_config. https://github.com/microsoft/Cream/blob/main/iRPE/DETR-with-iRPE/models/transformer.py#L75-L77
Zhong1015 commented 8 months ago

Thank you for your reply. According to your reply, I considered that for Q (Query) and K (Key), adding relative positional encoding would have a more significant impact. This is because in the self-attention mechanism, the similarity between Query and Key determines the weighting of attention distribution. Adding relative positional encoding solely to the value doesn't make much sense, as it does not directly participate in the calculation of attention weights. In addition, there is no need to add relative positional encoding to the qk in the cross-attention section, because the query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. Furthermore, there is also no need to add relative positional encoding to the qk in the cross-attention section, because: ① The query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. ②Applying relative positional encoding solely to the values in the cross-attention would not yield significant benefits either. Are my understandings correct? Looking forward to your reply.

Zhong1015 commented 8 months ago

屏幕截图 2024-03-25 224826 I represent the implementation of relative positional encoding with a sketch. Is this the process if IRPE is simultaneously applied to qkv? (Contextual mode)

wkcn commented 7 months ago

Thank you for your reply. According to your reply, I considered that for Q (Query) and K (Key), adding relative positional encoding would have a more significant impact. This is because in the self-attention mechanism, the similarity between Query and Key determines the weighting of attention distribution. Adding relative positional encoding solely to the value doesn't make much sense, as it does not directly participate in the calculation of attention weights. In addition, there is no need to add relative positional encoding to the qk in the cross-attention section, because the query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. Furthermore, there is also no need to add relative positional encoding to the qk in the cross-attention section, because: ① The query represents learnable label embeddings, and the key represents image feature information, which have no meaningful correlation between them. ②Applying relative positional encoding solely to the values in the cross-attention would not yield significant benefits either. Are my understandings correct? Looking forward to your reply.

Sorry for late reply. Your understandings are correct.

wkcn commented 7 months ago

屏幕截图 2024-03-25 224826 I represent the implementation of relative positional encoding with a sketch. Is this the process if IRPE is simultaneously applied to qkv? (Contextual mode)

Great work! It is notable that the values of RPE-Q, RPE-K and RPE-V are changed with the relative position.

image
Zhong1015 commented 7 months ago

Thank for your reply!