jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
MIT License
1.94k stars 370 forks source link

Why is the addition of convolution useless #38

Open haodonga opened 3 years ago

haodonga commented 3 years ago

I added data enhancement methods such as translation, rotation, and scaling to the test data sample, hoping to use the inductive bias of CNN, but R50+ViT did not achieve the expected effect. Under what circumstances will R50+ViT be better than ordinary ViT