SimMIM: A Simple Framework for Masked Image Modeling

lightly-ai / lightly

A python library for self-supervised learning on images.

https://docs.lightly.ai/self-supervised-learning/

MIT License

3.17k stars 285 forks source link

SimMIM: A Simple Framework for Masked Image Modeling #780

Closed guarin closed 1 year ago

guarin commented 2 years ago

SimMIM: A Simple Framework for Masked Image Modeling

18.11.2021 https://arxiv.org/abs/2111.09886 https://github.com/microsoft/SimMIM

Similar architecture as MAE but uses only a single linear layer as decoder instead of a transformer, passes masked and non-masked tokens to the encoder, and uses l1 instead of l2 loss.

Screenshot 2022-04-26 at 09 23 04

Estimated effort to implement in Lightly: Low, once MAE is implemented

Add linear decoder to MAE
Add l1 loss to MAE
Pass all tokens to the encoder
It probably makes more sense to implement it only using ViT as Swin is not implemented in torchvision.

jonnor commented 2 years ago

In the appendix, section E of the SimMIM paper they are testing with a ResNet50 - which also shows performance above that of other self-supervised approaches with convolutional neural networks (BYOB etc).

Will such a configuration also be possible with Lightly? Convnets have the advantage of being very well understood, have good compute performance, and are easy to deploy - including to accelerator chips.

guarin commented 2 years ago

Great find! I completely missed this part of the paper.

It looks relatively simple to implement but is not part of the official code which makes it a bit hard to assess if there are any pitfalls. We could give it a try, it could be even easier to implement than the transformer version. We just have to implement the masking for resnets and slightly adapt the forward pass.

jonnor commented 2 years ago

Looks like masked autoencoders now went in (#799), which is maybe a good starting point for SimMIM?

philippmwirth commented 2 years ago

Yes, this should be relatively easy to implement now if I remember our discussions correctly @guarin

guarin commented 2 years ago

Yes, I think we can just combine the MAEBackbone, a linear prediction head and the l1 loss in a module to build the SimMIM model. I would propose to add this as a new module in the imagenette benchmark file. Then we can test it and also see if we want to do some refactorings or add building blocks.

Adding support for resnets would involve some more work as we have to write a new backbone that supports masking.

guarin commented 1 year ago

Closed by: #1003