AnugunjNaman commented 3 years ago

🌟 New model addition

Model description

A new architecture, named Convolutional vision Transformers (CvT), that improves Vision Transformers (ViT) in performance and efficiently by introducing convolutions into ViT to yield the best of both designes. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (e.g. shift, scale, and distortion invariance) while maintaining the merits of Transformers (e.g. dynamic attention, global context, and better generalization).

Open source status

[ https://github.com/microsoft/CvT] the model implementation is available: The Microsoft Model is OpenSource and would be a good addition to huggingface library
[ https://1drv.ms/u/s!AhIXJn_J-blW9RzF3rMW7SsLHa8h?e=blQ0Al] the model weights are available: The pretrained weights are present in drive
[https://github.com/leoxiaobin] is the authors: @leoxiaobin

AnugunjNaman commented 3 years ago

I would like to work on this @LysandreJik if you feel it's a nice addition.

NielsRogge commented 3 years ago

Great suggestion! How is this model different from Facebook AI's ConViT?

Currently, we have ViT, DeiT and BEiT in the library. It would be cool to have a Vision Transformer with convolutional inductive biases in the library, as it's probably better in terms of sample efficiency/FLOPS. Perhaps you can compare CvT and ConViT, and add the best of the two to the library? I can help you if you want (I've contributed the aforementioned ones 😉 ).

AnugunjNaman commented 3 years ago

@NielsRogge yeah sure. Any help is great help. I haven't read ConvViT in depth but on skimming through it they have attempted to do something similar to convolutions. While CvT use pure convolution and here in this architecture they eliminate need for positional embedding, simplifying design for vision tasks with variable input resolution. Position Embedding is often realized by fixed-length learn-able vectors, limiting the trained model adaptation of variable-length input. This seems a good architecture even on metrics. Your thoughts? If you agree then I can move forward with your help since this my first contribution here.

NielsRogge commented 3 years ago

Position Embedding is often realized by fixed-length learn-able vectors, limiting the trained model adaptation of variable-length input.

Yeah indeed, models like ViT and BEiT require interpolation of the pre-trained position embeddings when fine-tuning, which is a pain.

Do you know how to get started to add a model? Most info can be found here and here.

AnugunjNaman commented 3 years ago

@NielsRogge yeah. I have gone through it. I can try following similarly as given ViT, BEiT. I can start it now. If I get stuck I will get back to you.

AnugunjNaman commented 2 years ago

The issue is resolved with PR #17299

huggingface / transformers

CvT: Convolution based Image Transformers #13158

🌟 New model addition

Model description

Open source status