It is useful in object detection context to allow arbitrary sizes by doing dynamic mask computation (probably possible only with relative position encoding).
Also, fyi I created a similar issue in SimMIM: https://github.com/microsoft/SimMIM/issues/13. Overall, having some stable version of swin_transformer.py somewhere (maybe even in main SwinTransformer/Swin-Transformer repo?) supporting dynamic masking would help a lot :)
It is useful in object detection context to allow arbitrary sizes by doing dynamic mask computation (probably possible only with relative position encoding).
These kinds of edits were done in https://github.com/SwinTransformer/Swin-Transformer-Object-Detection and in https://github.com/megvii-research/SOLQ/. It would be nice if you upstreamed these changes. This will simplify trying out ESviT checkpoints as pretraining for object detection.
Also, fyi I created a similar issue in SimMIM: https://github.com/microsoft/SimMIM/issues/13. Overall, having some stable version of swin_transformer.py somewhere (maybe even in main SwinTransformer/Swin-Transformer repo?) supporting dynamic masking would help a lot :)
Thanks!