Closed DoranLyong closed 2 years ago
Hi @DoranLyong ,
Many thanks for your attention.
Q1: It is a typo. The kernel size of the first downsampling is 7. I have fixed it.
Q2: From my experiments, kernel_size:=7
for the first stage and kernel_size:=3
for the remains perform better but bring more parameters and MACs. I used to reproduce ConvNeXt-Tiny (82.0%, 28.6M, 4.5G) vs ConvNeXt-Tiny-Downsampling-k7-k3 (82.3%, 30.5M, 4.7G).
@yuweihao Thanks. That's a clear result :)
How about using the spatially separable convolutions for reducing parameters in the downsampling stream?
Could it lead to worse performance?
@DoranLyong ,
Since I have not conducted experiments in this config, sorry that I can not offer data points or give a summary. I remember MobileNetV2 also uses separable convolutions for downsampling, so I guess it may work for the last three downsamplings.
@yuweihao Great! was a good insight for me :)
Hi, thanks for your other interesting work after PoolFormer :)
Question1
I have some curiosity about setting the options for downsampling layers. You seemed to comment
kernel_size:=4, stride:=4, and padding:=2
, but actually set as[7,4,2]
.If the
kernel_size:=4
, stride and padding should be set as [4, 1]. But, do you run the experiments setting the kernel size as 7?In your previous PoolFormer, the downsampling for the first stage is given [7, 4, 2] for kernel_size, stride, and padding.
Question2
Another question: I have been reviewing the other Metaformer-like models (following 4-stage architecture) and I found there are many different options for making the same patch size for each stage; [56, 28, 14, 7].
For example, Uniformer gives
kernel_size:=4
for the first stage andkernel_size:=2
for remains.In the downsampling stream, I wonder if you think there is some critical effect for their different kernel size?