Layers in Swin Transformer

microsoft / Swin-Transformer

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

https://arxiv.org/abs/2103.14030

MIT License

13.98k stars 2.06k forks source link

Layers in Swin Transformer #323

Open 920703 opened 1 year ago

920703 commented 1 year ago

I had a doubt about layers in Swin Transformer. As it is mentioned in the architecture of Swin-T that there are 2, 2, 6, 2 layers at stage 1,2,3 and 4.
What does it mean by 2 layers at 1st stage and 6 layers at 3rd stage. Although there are 2 successive swin transformer blocks, but I am confused with the term layers.

Does it mean that at Layer 1, W-MSA block will be executed and output given to SW-MSA block, then what happens next? What about Layer 2. Does the W-MSA block again executed on the output of SW-MSA block?

@zeliu98 @ancientmooner Please help. Others can also give their views. Thankyou.

920703 commented 1 year ago

Please reply, if anyone knows.

@zeliu98 @ancientmooner Please clear my doubt.

Regards

solomoneshetie commented 11 months ago

The architecture has four swin transformer blocks, and each block also consists of two. In my understanding, the given layers indicate how many times you should perform each swin transformer block.

mramezani64 commented 6 months ago

Why does the number of block repetitions follow the logic of having the highest number of repetitions in the third stage? Other Swin variants follow [2,2,18,2]. Can this logic be generalised to other modalities?

Matagi1996 commented 5 months ago

To achieve receptive field the window partition switches each second block inside a stage, this way information between the chunks of window divided tokens can be exchanged slowly with each other. This is why the block size is always dividable by 2. The logic for putting most blocks at 3rd stage is that the upper blocks learn trivial information before downsampling the feature map, so one dont need global information just yet. Putting most parameters into 3rd stage worked empiricly and is also seen in ResNets.