Can I say PoolFormer is just a non-trainable MLP-like module?

072jiajia commented 3 years ago

Hi! Thanks for sharing the great work! I have some questions about PoolFormer. If I explain PoolFormer like the following attachments, can I say PoolFormer is just a non-trainable MLP-like model?

yuweihao commented 3 years ago

Hi @072jiajia ,

Sure. There are a thousand Hamlets in a thousand people's eyes. Feel free to explain PoolFormer from different aspects.

072jiajia commented 3 years ago

Thanks for replying! Assuming the above statements are true, I'm curious about how your model's performance can be better than ResNet's? Because It is just a ResNet with some layers whose weights are not trainable. Or did I miss any part of the model?

yuweihao commented 3 years ago

Hi @072jiajia , in this paper, we claim that MetaFormer is actually what you need for vision. The competitive performance of PoolFormer steams from MetaFormer. You can see that the MetaFormer architecture is different from ResNet architecture. If the token mixer in Metaformer is specified as just a simple learnable Depthwise Convolution, better performance than PoolFormer will be obtained. This can be implemented by replacing self.token_mixer = Pooling(pool_size=pool_size) in the code with self.token_mixer = nn.Conv2d(in_channels=dim, out_channels=dim, kernel_size=3, stride=1, padding=1, groups=dim).

xingshulicc commented 2 years ago

I totally agree with @072jiajia. Exactly, I am not sure about the goal of this paper. The general architecture of MetaFormer is almost the same as vision transformers. The only difference is the token-mixer operator. By the way, your proposed pooling operation is a kind of non-trainable convolution. The patch embedding mentioned in the paper is a trainable convolution whose kernel size is equal to stride and the embedding dimension is the output channels. Therefore, I think your model is a variant of the Convolutional Neural Network; it is very similar to Network in Network (NIN) model (Proposed in 2013 by Lin et al.): the two linear layers can be implemented with two 1 x 1 convolutional layers.

yuweihao commented 2 years ago

Hi @xingshulicc ,

Thanks for your attention. The goal of this paper is not to propose novel models. The idea of this paper is to propose a hypothesis and come up with methods to verify it.

Hypothesis: Instead of the specific token mixer, the general architecture, termed MetaFormer, is more essential for the model to achieve competitive performance.

Verification: We specify the token mixer as an extremely simple operator, pooling, and find the derived model PoolFormer outperforms well-tuned Vision Transformer/MLP-like/ResNet baselines.

MetaFormer is not a specific model but an abstract model. By regarding the attention module as a specific token mixer, MetaFormer is abstracted from Transformer where the token mixer is not specified. By contrast, PoolFormer is a specific model that is obtained by specifying the token mixer as pooling in MetaFormer. PoolFormer is utilized as a tool to verify the hypothesis.

Besides PoolFormer, we also came up with other methods to verify the hypothesis, like specifying the token mixer as random matrix or depthwise convolution, referring to the ablation study table in the paper. Since pooling is simple, we finally choose it as a default tool to verify the hypothesis.

xingshulicc commented 2 years ago

Hi, Thank you for your reply. Based on your response, can I say that the most important part for MetaFormer is the architecture abstracted from Transformer, like Figure 1 presented in the paper? If it is true, how can I further improve MetaFormer performance in the future work? Maybe the best way is to enhance the token-mixer part, as shown in the ablation study table (Table 5). I am not questioning the contribution of this paper, I just feel that this article conflicts with my current research views. Of course your article is very solid.

yuweihao commented 2 years ago

Hi @xingshulicc ,

Yes, the essential part of our paper is the MetaFormer hypothesis. To improve the general architecture MetaFormer, maybe we can:

1) Instead of the abstracted token mixer in MetaFormer, improve a component or even the whole architecture of MetaFormer. For example, propose a new normalization that can steadily improve MetaFormer-like baselines (eg Transformer, MLP-like, or Poolformer models) 2) Propose a more effective/efficient optimizer to train MetaFormer-like models better/faster, outperforming the most commonly-used AdamW. 3) and so on ...

We also want to clarify that the MetaFormer hypothesis does not mean the token mixer is insignificant. MetaFormer still has this abstracted component. It means token mixer is not limited to a specific type, e.g. attention (MetaFormer is actually what you need vs. Attention is all you need). It makes sense that specifying better token mixer in MetaFormer would bring better performance (eg Pool vs DWConv in ablation study Table 5). For designing new token mixer, it is recommended to adopt MetaFormer as general architecture since it can guarantee the competitive performance of models (MetaFormer guarantees high lower bound of performance. See ablation study that replaces pool with identity mapping and random matrix). As currently many papers are focused on the token mixer, we hope this paper can inspire more future research devoted to improving the fundamental architecture MetaFormer.

xingshulicc commented 2 years ago

Hi, Thank you for your reply. I agree with some of your opinions. However, in the paper, I did not see substantial modifications to the MetaFormer general architecture. The ablation study part just compared the performance of the different components (Table 5). Furthermore, I can see that the token-mixer part contributes the most to the performance improvement. Of course, your paper really inspired me a lot, I also hope to come up with a simplified MetaFormer architecture in my future work. Thank you again.

sail-sg / poolformer

Can I say PoolFormer is just a non-trainable MLP-like module? #10