Open 920703 opened 1 year ago
Please reply, if anyone knows.
@zeliu98 @ancientmooner Please clear my doubt.
Regards
The architecture has four swin transformer blocks, and each block also consists of two. In my understanding, the given layers indicate how many times you should perform each swin transformer block.
Why does the number of block repetitions follow the logic of having the highest number of repetitions in the third stage? Other Swin variants follow [2,2,18,2]. Can this logic be generalised to other modalities?
To achieve receptive field the window partition switches each second block inside a stage, this way information between the chunks of window divided tokens can be exchanged slowly with each other. This is why the block size is always dividable by 2. The logic for putting most blocks at 3rd stage is that the upper blocks learn trivial information before downsampling the feature map, so one dont need global information just yet. Putting most parameters into 3rd stage worked empiricly and is also seen in ResNets.
I had a doubt about layers in Swin Transformer. As it is mentioned in the architecture of Swin-T that there are 2, 2, 6, 2 layers at stage 1,2,3 and 4.
What does it mean by 2 layers at 1st stage and 6 layers at 3rd stage. Although there are 2 successive swin transformer blocks, but I am confused with the term layers.
Does it mean that at Layer 1, W-MSA block will be executed and output given to SW-MSA block, then what happens next? What about Layer 2. Does the W-MSA block again executed on the output of SW-MSA block?
@zeliu98 @ancientmooner Please help. Others can also give their views. Thankyou.