microsoft / CvT

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.
MIT License
533 stars 120 forks source link

code mismatch with the theory #9

Open basavaraj-hampiholi opened 2 years ago

basavaraj-hampiholi commented 2 years ago

Hi All,

Thanks for providing the code. I come across the mismatch between the code and the theory you proposed for the transformer block. The paper says "Instead, we propose to replace the original position-wise linear projection for Multi-Head Self-Attention (MHSA)", but lines 198-200 in https://github.com/leoxiaobin/CvT/blob/main/lib/models/cls_cvt.py still projects q,k,v through linear layers. Have you missed an else statement there? why are you projecting q,k,v values twice?

Please correct me if I have misunderstood it.

Thanks, Basavaraj

askerlee commented 2 years ago

This is after conv_proj_q, conv_proj_k and conv_proj_v. But I'm not sure why the authors still use the pointwise projections after the conv projections.

basavaraj-hampiholi commented 2 years ago

@askerlee, I think it is part of depthwise separable convolutions. Depthwise convolutions followed by pointwise projections.

diaodeyi commented 2 years ago

I want to know the code how to call the get_cls_model function in the cls_cvt.py

basavaraj-hampiholi commented 2 years ago

Hi @diaodeyi , In the present code, get_cls_model function is called by the registry.py. You can use build_model in build.py to call the model. Otherwise, you can remove the registry and directly call get_cls_model function. Both way should work

Good luck..

diaodeyi commented 2 years ago

By the way , theself.proj = nn.Linear(dim_out, dim_out)Means FFN only projection with same dimension?

basavaraj-hampiholi commented 2 years ago

@diaodeyi It's the single linear layer (with the same in/out dimension) right after the attention calculation. The FFN in this code is class MLP (line 53).

diaodeyi commented 2 years ago

Thanks, there are so many linear projections that aren't be mentioned by paper.

basavaraj-hampiholi commented 2 years ago

@diaodeyi Yes. I think they have left them out with the presumption that the reader has a prior good understanding of basic transformer architecture.

diaodeyi commented 2 years ago

@askerlee, I think it is part of depthwise separable convolutions. Depthwise convolutions followed by pointwise projections.

No, I think the proj_q\k\v are exactly the things the paper does not mention.

Markin-Wang commented 1 year ago

@askerlee, I think it is part of depthwise separable convolutions. Depthwise convolutions followed by pointwise projections.

No, I think the proj_q\k\v are exactly the things the paper does not mention.

Hi, the seperable depth conv contains two parts: depth-wise conv and point-wise conv. The author implemented the point-wise conv via the linear layer, maybe because it's convenience for the ablation study. The only difference between them is the bias term.