Closed hsi1032 closed 2 years ago
Hi, thanks for your interest in our work.
This modification will lead to around 0.1% accuracy improvement on ImageNet for the GFNet-H-Ti model. We also found using a single residual connection in each block will stabilize the training of deeper models since the total number of residual blocks is halved. A similar design is also used in recent work like ConvNeXt [1].
[1] A ConvNet for the 2020s, https://arxiv.org/abs/2201.03545
Thank you for your fast reply!
Hello, thanks for your great work!
In your figure and code, there is no skip connection after the global filter layer.
This is different from original transformer implementation, which has 2 skip connections in a single block (each for self-attention layer and FFN layer)
For example, original transformer uses blocks like
But, global filter network uses below block
Is there any reason for adopting the current block architecture?
Thanks,