sail-sg / poolformer

PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)
https://arxiv.org/abs/2111.11418
Apache License 2.0
1.3k stars 117 forks source link

what makes pooling competitive performance or even more than attention? #43

Closed sudo1609 closed 1 year ago

sudo1609 commented 1 year ago

In this paper, you confirm that the success of ViT does not come from the attention token mixer but from a general architecture defined as metaFormer. And the special thing is that in the paper you just need to replace attention to a super simple pooling operator and it gives a SOTA performance. So the question is what makes pooling competitive performance or even more than attention?

yuweihao commented 1 year ago

Hi @TheK2NumberOne , thanks for your attention.

Model MetaFormer Token mixing ability Local inductive bias Params MACs Top-1 Acc
ResNet-50 No Strong More 26M 4.1G 79.8
PoolFormer-S24 Yes Weak More 21M 3.4G 80.3
DeiT-S (Transformer) Yes Strong Less 22M 4.6G 79.8
  1. Compared with ResNet, since the local spatial modeling ability of the pooling layer is much worse than the ResNet, the competitive performance of PoolFormer can only be attributed to its general architecture MetaFormer.
  2. Compared with DeiT, the better performance of PoolFormer may result from the more local inductive bias of pooling.