2021 [ICLR] (ViT) An image is worth 16x16 words: transformers for image recognition at scale

Introduction

The paper shows that a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. ViT attains results comparable to the state-of-art CNN networks while requiring substantially fewer computation resources.

Method

Non-overlapped patches: ViT divides image to several non-overlapped patches, flattens patches and then uses embedding layer to map flattened patches to fixed size 1D latent as the inputs of transformers.
Positional encoding: Unlike CNN, transformer doesn't include the position information from the patches. To bridge this gap, this papers finds 1D learnable latents outperforms predefined 1D or 2D position encoding. The follow-up experiment shows this learnable latents do contain the position information.(The choice of positional encoding is crucial for transformer structure)
Class embedding: Similar to BERT’s [class] token, ViT prepend a learnable embedding to the sequence of embedded patches (z0 = xclass)

Highlight

A pioneer research to prove transformer can achieved the performance of SOA CNN in images domain.
BiT: Big Transfer (BiT)[^1], which performs supervised transfer learning with large ResNets.
Noisy Student[^2], which is a large EfficientNet trained using semi-supervised learning on ImageNet and JFT- 300M with the labels removed.

[^1]: A large-scale study of representation learning with the visual task adaptation benchmark. [^2]: Self-training with noisy student improves imagenet classification

Limitation

Vit performs worser in small dataset
Comments

pomelyu / paper-reading-notes

2021 [ICLR] (ViT) An image is worth 16x16 words: transformers for image recognition at scale #17

Introduction

Method

Highlight

Limitation

Comments