lightly-ai / lightly

A python library for self-supervised learning on images.
https://docs.lightly.ai/self-supervised-learning/
MIT License
2.92k stars 250 forks source link

SimMIM and non-ViT backbones #1135

Open faris-k opened 1 year ago

faris-k commented 1 year ago

I'm noticing very subpar performance from SimMIM on my task compared to MAE, and this also seems to be an issue on the Imagenette benchmarks. I was wondering what might be causing this, and whether we'd still see performance issues with a non-ViT backbone. Is it possible to use backbones like convnets and Swin transformers with the current implementation of SimMIM? I'm curious how you'd need to change the forward_encoder method to do so, and whether images_to_tokens could be generalized to other backbones.

def forward_encoder(self, images, batch_size, idx_mask):
    # pass all the tokens to the encoder, both masked and non masked ones
    tokens = self.backbone.images_to_tokens(images, prepend_class_token=True)
    tokens_masked = utils.mask_at_index(tokens, idx_mask, self.mask_token)
    return self.backbone.encoder(tokens_masked)
guarin commented 1 year ago

Hi @faris-k! We noticed this as well but I didn't look into it yet. From a quick glance at a code I believe the linear decoder head might be missing. The reference implementation is here: https://github.com/microsoft/SimMIM/blob/d3e29bcac950b83edc34ca33fe4404f38309052c/models/simmim.py#L104

But I guess a simple linear layer might be enough in our setup.

Also note that we measure performance using KNN which is lower than linear eval/finetuning. ViT based architectures generally require finetuning for good performance.