Closed sudo1609 closed 1 year ago
Hi @TheK2NumberOne , thanks for your attention.
Model | MetaFormer | Token mixing ability | Local inductive bias | Params | MACs | Top-1 Acc |
---|---|---|---|---|---|---|
ResNet-50 | No | Strong | More | 26M | 4.1G | 79.8 |
PoolFormer-S24 | Yes | Weak | More | 21M | 3.4G | 80.3 |
DeiT-S (Transformer) | Yes | Strong | Less | 22M | 4.6G | 79.8 |
In this paper, you confirm that the success of ViT does not come from the attention token mixer but from a general architecture defined as metaFormer. And the special thing is that in the paper you just need to replace attention to a super simple pooling operator and it gives a SOTA performance. So the question is what makes pooling competitive performance or even more than attention?