microsoft / CSWin-Transformer

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped, CVPR 2022
MIT License
539 stars 78 forks source link

About the patches_resolution of the segmentation model #21

Closed danczs closed 2 years ago

danczs commented 2 years ago

Hello, this work is interesting but I have some questions about the 'patches_resolution' of the segmentation model. I notice that the long side of the cross-shaped windows is the 'patches_resolution' rather than the real feature resoulution. For example, in the stage-3, the long side is 224 / 16 = 14. Do I understand it correctly? Does that make it impossible to exchannge information outside the 'patches_resolution' ?

LightDXY commented 2 years ago

Hi, this is just a predefined parameter to init the model, so we set it the same as the ImageNet setting (224) to load pretrained model wit the same size. In practice, the feature resolution is defined by the current input, as its size is dynamic. So we could ensure that the attention window is always H x sw or W x sw.

danczs commented 2 years ago

Thank for your reply. I have found the corresponding code.