Swin Transformer aims to be a general purpose backbone for vision tasks. It is a visual transformer which has linear computation complexity with respect to image size. Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K,
Method
Mapping each patch to linear embedding. (use nn.Conv2d(3, emb_dim, (patch_size, patch_size), strides=(patch_size, patch_size))), patch_size: 4x4
Window based multi-head self attention (W-MSA) module with relative position bias: Swin transformer apply transformer on the patches over the non-overlapping window, window_size: 7x7. Note that a learnable positional embedding is applied to provide the relative position information inside the window.
Shifted window attention: Since the windows are non-overlapping, Swin transformer uses this mechanism to calculate the attention between the neighbor windows
Patch merge: the "patch" version of pixel-shuttle to half the feature size and build the hierarchy structure.
apply stochastic_depth for each residual block: i.e. stochastically choose whether to bypass the block in training time. torchvision.ops. stochastic_depth
Highlight
Compared with other transformer and CNN based models, Swin Transformer achieve same accuracy with smaller model size or fewer operations.
Introduction
Swin Transformer aims to be a general purpose backbone for vision tasks. It is a visual transformer which has linear computation complexity with respect to image size. Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K,
Method
nn.Conv2d(3, emb_dim, (patch_size, patch_size), strides=(patch_size, patch_size))
), patch_size: 4x4stochastic_depth
for each residual block: i.e. stochastically choose whether to bypass the block in training time. torchvision.ops. stochastic_depthHighlight
Limitation
Comments