IRPE: About the shape of Wq,Wk,Wv’

microsoft / Cream

This is a collection of our NAS and Vision Transformer work.

MIT License

1.66k stars 225 forks source link

IRPE: About the shape of Wq,Wk,Wv’ #229

Closed Zhong1015 closed 6 months ago

Zhong1015 commented 6 months ago

@wkcn Hello, your work is inspiring. Recently, while looking at Figure 1, I had a bit of confusion. You have noted '1x1' after Wq, Wk, Wv, which makes me think you are using weight matrices of size 1x1. However, this contradicts my previous understanding. Another interpretation I have is: 1x1 indicates the need to first reduce the dimensionality of x, and then pass through the Wq, Wk, and Wv matrices. I would like to know if my understanding of this part is correct. 屏幕截图 2024-03-19 202425

wkcn commented 6 months ago

Hi @Zhong1015 , the shape of Wq, Wk and Wv are all (dim, dim). We annotated 1x1 which represents a 1x1 operation on 2-D space, where each token is projected independently.

We referred to the figure 2 of non-local neural network (https://arxiv.org/pdf/1711.07971.pdf).

Zhong1015 commented 6 months ago

Thank you for your reply! I think this can help me a lot!

Zhong1015 commented 6 months ago

Hi @Zhong1015 , the shape of Wq, Wk and Wv are all (dim, dim). We annotated 1x1 which represents a 1x1 operation on 2-D space, where each token is projected independently.

We referred to the figure 2 of non-local neural network (https://arxiv.org/pdf/1711.07971.pdf).

I read the paper you provided. They mentioned in the paper that 1x1 convolution is used for dimensionality reduction. However, when using IRPE, in practice, according to the initial operations of the transformer, the input needs to undergo a linear transformation to obtain QKV (Query, Key, Value). Is this linear transformation operation actually performed in Figure 1, although it is not specifically indicated on the figure?

wkcn commented 6 months ago

@Zhong1015

Is this linear transformation operation actually performed in Figure 1?

Yes. The linear transformation is the same as that in self-attention.

https://github.com/microsoft/Cream/blob/main/iRPE/DeiT-with-iRPE/rpe_vision_transformer.py#L70