English | 简体中文
A curated list and survey of awesome Vision Transformers.
You can use mind mapping software to open the mind mapping source file. You can also download the mind mapping HD pictures if you just want to browse them.
Only typical algorithms are listed in each category.
Chinese Blogs
Image to Token:
Non-overlapping Patch Embedding
Overlapping Patch Embedding
[T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
[ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
[PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
Token to Token:
Explicit position encoding:
Implicit position encoding:
Include only global attention:
Multi-Head attention module
Reduce global attention computation
[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
[Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
[P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
[MViT] Multiscale Vision Transformers (2021.4) [Paper]
[Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
Generalized linear attention
Introduce extra local attention:
Local window mode
Introduce convolutional local inductive bias
Sparse attention
Improve performance with Conv's local information extraction capability:
Pre Normalization
Post Normalization
Class Tokens
Avgerage Pooling
(1) How to output multi-scale feature map
Patch merging
Pooling attention
Dilation convolution
(2) How to train a deeper Transformer
[MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]
[ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]
[gMLP] Pay Attention to MLPs (2021.5) [Paper]
[CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]
[T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet (2021.1) [Paper]
[PVT] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions (2021.2) [Paper]
[CPVT] Conditional Positional Encodings for Vision Transformers (2021.2) [Paper]
[TNT] Transformer in Transformer (NeurIPS 2021-2021.3) [Paper]
[Cait] Going deeper with Image Transformers (2021.3) [Paper]
[DeepViT] DeepViT: Towards Deeper Vision Transformer (2021.3) [Paper]
[Swin Transformer] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV2021-2021.3) [Paper]
[CeiT] Incorporating Convolution Designs into Visual Transformers (2021.3) [Paper]
[LocalViT] LocalViT: Bringing Locality to Vision Transformers (2021.4) [Paper]
[MViT] Multiscale Vision Transformers (2021.4) [Paper]
[Twins] Twins: Revisiting the Design of Spatial Attention in Vision Transformers (2021.4) [Paper]
[Token Labeling] All Tokens Matter: Token Labeling for Training Better Vision Transformers (2021.4) [Paper]
[ResT] ResT: An Efficient Transformer for Visual Recognition (2021.5) [Paper]
[MLP-Mixer] MLP-Mixer: An all-MLP Architecture for Vision (2021.5) [Paper]
[ResMLP] ResMLP: Feedforward networks for image classification with data-efficient training (CVPR2021-2021.5) [Paper]
[gMLP] Pay Attention to MLPs (2021.5) [Paper]
[MSG-Transformer] MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens (2021.5) [Paper]
[PVTv2] PVTv2: Improved Baselines with Pyramid Vision Transformer (2021.6) [Paper]
[TokenLearner] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? (2021.6) [Paper]
Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight (2021.6) [Paper]
[P2T] P2T: Pyramid Pooling Transformer for Scene Understanding (2021.6) [Paper]
[GG-Transformer] Glance-and-Gaze Vision Transformer (2021.6) [Paper]
[Shuffle Transformer] Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer (2021.6) [Paper]
[ViTAE] ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2021.6) [Paper]
[CycleMLP] CycleMLP: A MLP-like Architecture for Dense Prediction (2021.7) [Paper]
[CSWin Transformer] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows (2021.07) [Paper]
[PS-ViT] Vision Transformer with Progressive Sampling (2021.8) [Paper]
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP (2021.8) [Paper]
[Swin Transformer V2] Swin Transformer V2: Scaling Up Capacity and Resolution (2021.11) [Paper]
[MetaFormer] MetaFormer is Actually What You Need for Vision (2021.11) [Paper]
[Imporved MViT] Improved Multiscale Vision Transformers for Classification and Detection (2021.12) [Paper]
[ELSA] ELSA: Enhanced Local Self-Attention for Vision Transformer (2021.12) [Paper]
[ConvMixer] Patches Are All You Need [Paper]
Stay tuned and PRs are welcomed!