image sizes in pretrained models for SwinTransformer vs dataset image size

Xenosender commented 2 years ago

Hi all

I'm trying to retrain an object detector with a SwinTransformer backbone using a pretrained pth.

Everything is working well but if I'm understanding of the mask_rcnn_swin-t-p4-w7_fpn_1x_coco.py config correctly:

the backbone is instantiated with the default value for pretrained_img_size=224
the backbone loads from a pretrained pth with the same image size (quite logical)
the network is then applied on the COCO dataset, with a train_pipeline which involves resizing images with sizes between 480 and 800 (which is consistant with the original paper)

Though the whole process seems consistent with the original paper, I don't really understand how pretraining on images that are so much smaller makes a good pretrained model (though I imagine it is still better than no pretraining at all). Wouldn't the patches bear very different descriptors between size 224 images and size 600 images? Wouldn't it be much more helpful to have a pretrained model with image size much closer to the target size ? Or did I miss something that makes it not a problem ?

Thanks

shinya7y commented 2 years ago

High-resolution pre-training is better even for CNNs.

YOLOv2 on VOC 2007 https://arxiv.org/abs/1612.08242

High Resolution Classifier. All state-of-the-art detection methods use classifier pre-trained on ImageNet [16]. Starting with AlexNet most classifiers operate on input images smaller than 256 × 256 [8]. The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution. For YOLOv2 we first fine tune the classification network at the full 448 × 448 resolution for 10 epochs on ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We then fine tune the resulting network on detection. This high resolution classification network gives us an increase of almost 4% mAP.

The improvement of mAP on COCO will probably be smaller than that on VOC 2007.

Xenosender commented 2 years ago

Thanks, and I agree about the CNNs. I was just wondering if I had missed something on the subject with transformer-based models

shinya7y commented 2 years ago

Swin Transformer V2 discusses a related issue. https://arxiv.org/abs/2111.09883

Secondly, many downstream vision tasks such as object detection and semantic segmentation require high resolution input images or large attention windows. The window size variations between low-resolution pre-training and high-resolution fine-tuning can be quite large. The current common practice is to perform a bi-cubic interpolation of the position bias maps [15, 35]. This simple fix is somewhat ad-hoc and the result is usually sub-optimal. We introduce a log-spaced continuous position bias (Log-CPB), which generates bias values for arbitrary coordinate ranges by applying a small meta network on the log-spaced coordinate inputs. Since the meta network takes any coordinates, a pre-trained model will be able to freely transfer across window sizes by sharing weights of the meta network. A critical design of our approach is to transform the coordinates into the log-space so that the extrapolation ratio can be low even when the target window size is significantly larger than that of pre-training.

The I512, I640, I768 models with CPB in Table 1 may be good pre-trained models.

open-mmlab / mmdetection

image sizes in pretrained models for SwinTransformer vs dataset image size #7233