pomelyu / paper-reading-notes

0 stars 0 forks source link

2021 SimMIM: a Simple Framework for Masked Image Modeling #20

Open pomelyu opened 7 months ago

pomelyu commented 7 months ago

Introduction

image

This paper aims to build BERT model for vision task by prediction the masked region from other parts of image. The authors utilize visual transformer(Swin Transformer) and conduct comprehensive investigation about the different training strategies includes the mask region size, prediction head, prediction target and loss function.

They found that using the moderate mask size, predicting RGB value and using a linear layer for prediction head can provide the powerful pretrained model.

Method

Highlight

Limitation

Comments