Closed Zhong1015 closed 6 months ago
Hi @Zhong1015 , the shape of Wq, Wk and Wv are all (dim, dim). We annotated 1x1 which represents a 1x1 operation on 2-D space, where each token is projected independently.
We referred to the figure 2 of non-local neural network (https://arxiv.org/pdf/1711.07971.pdf).
Thank you for your reply! I think this can help me a lot!
Hi @Zhong1015 , the shape of Wq, Wk and Wv are all (dim, dim). We annotated 1x1 which represents a 1x1 operation on 2-D space, where each token is projected independently.
We referred to the figure 2 of non-local neural network (https://arxiv.org/pdf/1711.07971.pdf).
I read the paper you provided. They mentioned in the paper that 1x1 convolution is used for dimensionality reduction. However, when using IRPE, in practice, according to the initial operations of the transformer, the input needs to undergo a linear transformation to obtain QKV (Query, Key, Value). Is this linear transformation operation actually performed in Figure 1, although it is not specifically indicated on the figure?
@Zhong1015
Is this linear transformation operation actually performed in Figure 1?
Yes. The linear transformation is the same as that in self-attention.
https://github.com/microsoft/Cream/blob/main/iRPE/DeiT-with-iRPE/rpe_vision_transformer.py#L70
@wkcn Hello, your work is inspiring. Recently, while looking at Figure 1, I had a bit of confusion. You have noted '1x1' after Wq, Wk, Wv, which makes me think you are using weight matrices of size 1x1. However, this contradicts my previous understanding. Another interpretation I have is: 1x1 indicates the need to first reduce the dimensionality of x, and then pass through the Wq, Wk, and Wv matrices. I would like to know if my understanding of this part is correct.