pomelyu / paper-reading-notes

0 stars 0 forks source link

2021 [ICLR] (ViT) An image is worth 16x16 words: transformers for image recognition at scale #17

Open pomelyu opened 8 months ago

pomelyu commented 8 months ago

Introduction

The paper shows that a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. ViT attains results comparable to the state-of-art CNN networks while requiring substantially fewer computation resources.

Method

image

Highlight

[^1]: A large-scale study of representation learning with the visual task adaptation benchmark. [^2]: Self-training with noisy student improves imagenet classification

Limitation