Closed jizongFox closed 2 years ago
Hi @jizongFox ,
Thanks for your attention. Similar to convolution, pooling a local operator that only mixes nearby tokens, so positional embedding is not needed. I used to add positional embedding into PoolFormer-S12 and can not observe significant improvement.
For MLP, $X' = WX$, where $X'\in \mathcal{R}^{N\times C}$ and $X \in \mathcal{R}^{N\times C}$ are output and input features with token length $N$ and channel number $C$, and $W \in \mathcal{R}^{N\times N}$ are learnable parameters. Each position $i \in [0, 1, 2, 3, ..., N-1]$ has corresponding parameters $W_{i, :} \in \mathcal{R}^{1 \times N}$. Thus, the positional embedding is not needed for MLP models.
Hello Yu,
I am convinced by your answer and thank you a lot. However for MLP, the network may not be able to recognize the position relationship along x and y axis if the positional embedding is not included. It can only retain a relationship for flatten vectors. Does this make any sense?
Hi @jizongFox ,
In my opinion, MLP models can recoginize 2D structure by training from large amount of images. For position $i \in [0, 1, 2, ..., N-1]$ corresponding learnable parameters $W{i, :} \in \mathcal{R}^{1\times N}$, reshape it as $M^i = \mathrm{Reshape}(W{i, :}) \in \mathcal{R}^{H \times W}$ where $H$ is the height, $W$ is the width and $N = HW$. If we visualize the $M^i$, it can be seen that $M^i{j, k}$ ($j = i // W$, $k=i$%$W$) and its nearby positions $M^i{p, q}$ ($p \in [j-1, j, j+1]$, $q \in [k-1, k, k+1]$ if have) will have large scores. You can refer to the visualization examples from Figure 2 and Figure B.1 in ResMLP paper.
Thank you for your reply. I will carefully read these works and reopen the issue if my question persists.
Hello authors,
I appreciate a lot your current work, which inspired the community. I am here to raise a very simple and quick question after checking the code and architecture design.
I observed that network using pooling, MLP or identical as token mixer, you do not include positional embedding, while you consider this component only when you use MHA. What is the concern of this design and why other models do not rely on this embedding?
Best,