Closed fbragman closed 2 years ago
Hi @fbragman ,
Thanks for your question. We did check the performance when training from scratch on ADE-20k and you can find results in the appendix of our paper. We ablate for ViT+Linear.
Pre-training is key in general for detection and localization tasks such as segmentation. The main reason is that downstream task datasets (such as ADE-20k, Cityscapes or Pascal) are just too small compared to a classification dataset such as ImageNet or ImageNet-21k for example. More data is needed to get good performance with current deep learning models. This is true either with a CNN or a Transformer backbone.
Hi,
I am trying to use the baseline model (Linear decoder) described in the paper as a baseline for some of my work. However, I do not have access to pre-trained ImageNet weights. My model is not able to learn, converging at around 0.25 mDICE on the training set of Cityscapes. This is after hyper parameter optimisation across SGD, Adam + different learning rate schedulers.
I was wondering if during your experiments, you saw similar levels of performance when you did not initialise your transformer backbones with pre-trained weights? Was this tested for the baseline (ViT + Linear) and your proposed method (ViT + Mask)?
Thank you