Closed guarin closed 1 year ago
In the appendix, section E of the SimMIM paper they are testing with a ResNet50 - which also shows performance above that of other self-supervised approaches with convolutional neural networks (BYOB etc).
Will such a configuration also be possible with Lightly? Convnets have the advantage of being very well understood, have good compute performance, and are easy to deploy - including to accelerator chips.
Great find! I completely missed this part of the paper.
It looks relatively simple to implement but is not part of the official code which makes it a bit hard to assess if there are any pitfalls. We could give it a try, it could be even easier to implement than the transformer version. We just have to implement the masking for resnets and slightly adapt the forward pass.
Looks like masked autoencoders now went in (#799), which is maybe a good starting point for SimMIM?
Yes, this should be relatively easy to implement now if I remember our discussions correctly @guarin
Yes, I think we can just combine the MAEBackbone
, a linear prediction head and the l1 loss in a module to build the SimMIM
model. I would propose to add this as a new module in the imagenette benchmark file. Then we can test it and also see if we want to do some refactorings or add building blocks.
Adding support for resnets would involve some more work as we have to write a new backbone that supports masking.
Closed by: #1003
SimMIM: A Simple Framework for Masked Image Modeling
18.11.2021 https://arxiv.org/abs/2111.09886 https://github.com/microsoft/SimMIM
Similar architecture as MAE but uses only a single linear layer as decoder instead of a transformer, passes masked and non-masked tokens to the encoder, and uses l1 instead of l2 loss.
Estimated effort to implement in Lightly: Low, once MAE is implemented