sail-sg / metaformer

MetaFormer Baselines for Vision (TPAMI 2024)
https://arxiv.org/abs/2210.13452
Apache License 2.0
416 stars 27 forks source link

About downsampling(patch_embed) for first stage #4

Closed DoranLyong closed 2 years ago

DoranLyong commented 2 years ago

Hi, thanks for your other interesting work after PoolFormer :)

Question1

I have some curiosity about setting the options for downsampling layers. You seemed to comment kernel_size:=4, stride:=4, and padding:=2, but actually set as [7,4,2]. image

If the kernel_size:=4, stride and padding should be set as [4, 1]. But, do you run the experiments setting the kernel size as 7?

In your previous PoolFormer, the downsampling for the first stage is given [7, 4, 2] for kernel_size, stride, and padding.


Question2

Another question: I have been reviewing the other Metaformer-like models (following 4-stage architecture) and I found there are many different options for making the same patch size for each stage; [56, 28, 14, 7].

For example, Uniformer gives kernel_size:=4 for the first stage and kernel_size:=2 for remains.

In the downsampling stream, I wonder if you think there is some critical effect for their different kernel size?

yuweihao commented 2 years ago

Hi @DoranLyong ,

Many thanks for your attention. Q1: It is a typo. The kernel size of the first downsampling is 7. I have fixed it. Q2: From my experiments, kernel_size:=7for the first stage and kernel_size:=3 for the remains perform better but bring more parameters and MACs. I used to reproduce ConvNeXt-Tiny (82.0%, 28.6M, 4.5G) vs ConvNeXt-Tiny-Downsampling-k7-k3 (82.3%, 30.5M, 4.7G).

DoranLyong commented 2 years ago

@yuweihao Thanks. That's a clear result :)

How about using the spatially separable convolutions for reducing parameters in the downsampling stream?

Could it lead to worse performance?

yuweihao commented 2 years ago

@DoranLyong ,

Since I have not conducted experiments in this config, sorry that I can not offer data points or give a summary. I remember MobileNetV2 also uses separable convolutions for downsampling, so I guess it may work for the last three downsamplings.

image

DoranLyong commented 2 years ago

@yuweihao Great! was a good insight for me :)