[17]CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
To address this issue, they develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width.(Transformer network which achieves strong modeling capability while limiting the computation cost)

3.What is the heart of this method?

Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks.
Performing self-attention in horizontal and vertical stripes in parallel that form a cross-shaped window: they equally split the K heads into two parallel groups (each has K/2 heads, K is often an even value). The first group of heads perform horizontal stripes self-attention while the second group of heads perform vertical stripes self-attention. Finally the output of these two parallel groups will be concatenated back together. 4.results

5.can I remember some related works 6.related parts

ouusan / some-papers