SimMIM and non-ViT backbones

lightly-ai / lightly

A python library for self-supervised learning on images.

MIT License

2.92k stars 250 forks source link

I'm noticing very subpar performance from SimMIM on my task compared to MAE, and this also seems to be an issue on the Imagenette benchmarks. I was wondering what might be causing this, and whether we'd still see performance issues with a non-ViT backbone. Is it possible to use backbones like convnets and Swin transformers with the current implementation of SimMIM? I'm curious how you'd need to change the forward_encoder method to do so, and whether images_to_tokens could be generalized to other backbones.

def forward_encoder(self, images, batch_size, idx_mask):
    # pass all the tokens to the encoder, both masked and non masked ones
    tokens = self.backbone.images_to_tokens(images, prepend_class_token=True)
    tokens_masked = utils.mask_at_index(tokens, idx_mask, self.mask_token)
    return self.backbone.encoder(tokens_masked)

lightly-ai / lightly

SimMIM and non-ViT backbones #1135