2021 Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Introduction

Swin Transformer aims to be a general purpose backbone for vision tasks. It is a visual transformer which has linear computation complexity with respect to image size. Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K,

Method

Mapping each patch to linear embedding. (use nn.Conv2d(3, emb_dim, (patch_size, patch_size), strides=(patch_size, patch_size))), patch_size: 4x4
Window based multi-head self attention (W-MSA) module with relative position bias: Swin transformer apply transformer on the patches over the non-overlapping window, window_size: 7x7. Note that a learnable positional embedding is applied to provide the relative position information inside the window.
Shifted window attention: Since the windows are non-overlapping, Swin transformer uses this mechanism to calculate the attention between the neighbor windows
Patch merge: the "patch" version of pixel-shuttle to half the feature size and build the hierarchy structure.
apply stochastic_depth for each residual block: i.e. stochastically choose whether to bypass the block in training time. torchvision.ops. stochastic_depth

Highlight

Compared with other transformer and CNN based models, Swin Transformer achieve same accuracy with smaller model size or fewer operations.

pomelyu / paper-reading-notes

2021 Swin Transformer: Hierarchical Vision Transformer using Shifted Windows #18

Introduction

Method

Highlight

Limitation

Comments