sail-sg / poolformer

PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)
https://arxiv.org/abs/2111.11418
Apache License 2.0
1.3k stars 117 forks source link

Design on positional embedding? #39

Closed jizongFox closed 2 years ago

jizongFox commented 2 years ago

Hello authors,

I appreciate a lot your current work, which inspired the community. I am here to raise a very simple and quick question after checking the code and architecture design.

I observed that network using pooling, MLP or identical as token mixer, you do not include positional embedding, while you consider this component only when you use MHA. What is the concern of this design and why other models do not rely on this embedding?

Best,

yuweihao commented 2 years ago

Hi @jizongFox ,

Thanks for your attention. Similar to convolution, pooling a local operator that only mixes nearby tokens, so positional embedding is not needed. I used to add positional embedding into PoolFormer-S12 and can not observe significant improvement.

For MLP, $X' = WX$, where $X'\in \mathcal{R}^{N\times C}$ and $X \in \mathcal{R}^{N\times C}$ are output and input features with token length $N$ and channel number $C$, and $W \in \mathcal{R}^{N\times N}$ are learnable parameters. Each position $i \in [0, 1, 2, 3, ..., N-1]$ has corresponding parameters $W_{i, :} \in \mathcal{R}^{1 \times N}$. Thus, the positional embedding is not needed for MLP models.

jizongFox commented 2 years ago

Hello Yu,

I am convinced by your answer and thank you a lot. However for MLP, the network may not be able to recognize the position relationship along x and y axis if the positional embedding is not included. It can only retain a relationship for flatten vectors. Does this make any sense?

yuweihao commented 2 years ago

Hi @jizongFox ,

In my opinion, MLP models can recoginize 2D structure by training from large amount of images. For position $i \in [0, 1, 2, ..., N-1]$ corresponding learnable parameters $W{i, :} \in \mathcal{R}^{1\times N}$, reshape it as $M^i = \mathrm{Reshape}(W{i, :}) \in \mathcal{R}^{H \times W}$ where $H$ is the height, $W$ is the width and $N = HW$. If we visualize the $M^i$, it can be seen that $M^i{j, k}$ ($j = i // W$, $k=i$%$W$) and its nearby positions $M^i{p, q}$ ($p \in [j-1, j, j+1]$, $q \in [k-1, k, k+1]$ if have) will have large scores. You can refer to the visualization examples from Figure 2 and Figure B.1 in ResMLP paper.

jizongFox commented 2 years ago

Thank you for your reply. I will carefully read these works and reopen the issue if my question persists.