rstrudel / segmenter

[ICCV2021] Official PyTorch implementation of Segmenter: Transformer for Semantic Segmentation
MIT License
846 stars 175 forks source link

How much does ImageNet pre-training affect model performance? #53

Closed fbragman closed 2 years ago

fbragman commented 2 years ago

Hi,

I am trying to use the baseline model (Linear decoder) described in the paper as a baseline for some of my work. However, I do not have access to pre-trained ImageNet weights. My model is not able to learn, converging at around 0.25 mDICE on the training set of Cityscapes. This is after hyper parameter optimisation across SGD, Adam + different learning rate schedulers.

I was wondering if during your experiments, you saw similar levels of performance when you did not initialise your transformer backbones with pre-trained weights? Was this tested for the baseline (ViT + Linear) and your proposed method (ViT + Mask)?

Thank you

rstrudel commented 2 years ago

Hi @fbragman ,

Thanks for your question. We did check the performance when training from scratch on ADE-20k and you can find results in the appendix of our paper. We ablate for ViT+Linear.

Pre-training is key in general for detection and localization tasks such as segmentation. The main reason is that downstream task datasets (such as ADE-20k, Cityscapes or Pascal) are just too small compared to a classification dataset such as ImageNet or ImageNet-21k for example. More data is needed to get good performance with current deep learning models. This is true either with a CNN or a Transformer backbone.