2021 SimMIM: a Simple Framework for Masked Image Modeling

Introduction

This paper aims to build BERT model for vision task by prediction the masked region from other parts of image. The authors utilize visual transformer(Swin Transformer) and conduct comprehensive investigation about the different training strategies includes the mask region size, prediction head, prediction target and loss function.

They found that using the moderate mask size, predicting RGB value and using a linear layer for prediction head can provide the powerful pretrained model.

Method

Highlight

Limitation

Comments

Contrastive learning: learning the meaningful representation of data by the similarity and dissimilarity
Images are different from the text for its strong local correlation, which results in the comparative large masked ratio.
Unlike the common practice in contrastive learning, authors found using the single linear layer outperform MLP for prediction head.

pomelyu / paper-reading-notes

2021 SimMIM: a Simple Framework for Masked Image Modeling #20

Introduction

Method

Highlight

Limitation

Comments