Question about block design

hsi1032 commented 2 years ago

Hello, thanks for your great work!

In your figure and code, there is no skip connection after the global filter layer.

This is different from original transformer implementation, which has 2 skip connections in a single block (each for self-attention layer and FFN layer)

For example, original transformer uses blocks like

x = x + SA(x)
x = x + FFN(x)

But, global filter network uses below block

x_ = Global_Filter(x)
x = x + FFN(x_)

Is there any reason for adopting the current block architecture?

Thanks,

raoyongming commented 2 years ago

Hi, thanks for your interest in our work.

This modification will lead to around 0.1% accuracy improvement on ImageNet for the GFNet-H-Ti model. We also found using a single residual connection in each block will stabilize the training of deeper models since the total number of residual blocks is halved. A similar design is also used in recent work like ConvNeXt [1].

[1] A ConvNet for the 2020s, https://arxiv.org/abs/2201.03545

hsi1032 commented 2 years ago

Thank you for your fast reply!

raoyongming / GFNet

Question about block design #11