Open faris-k opened 1 year ago
Hi @faris-k! We noticed this as well but I didn't look into it yet. From a quick glance at a code I believe the linear decoder head might be missing. The reference implementation is here: https://github.com/microsoft/SimMIM/blob/d3e29bcac950b83edc34ca33fe4404f38309052c/models/simmim.py#L104
But I guess a simple linear layer might be enough in our setup.
Also note that we measure performance using KNN which is lower than linear eval/finetuning. ViT based architectures generally require finetuning for good performance.
I'm noticing very subpar performance from SimMIM on my task compared to MAE, and this also seems to be an issue on the Imagenette benchmarks. I was wondering what might be causing this, and whether we'd still see performance issues with a non-ViT backbone. Is it possible to use backbones like convnets and Swin transformers with the current implementation of SimMIM? I'm curious how you'd need to change the
forward_encoder
method to do so, and whetherimages_to_tokens
could be generalized to other backbones.