Closed yuhua666 closed 7 months ago
I have found it. May I ask how this window size affects performance?
Hi @yuhua666 ,
According to our experiments on LocalVim, using 2x2
scan solely performs better than using 7x7
solely, as 7x7
window may be too large to capture small details in 14x14 tokens. Besides, a combination of 2x2
and 7x7
through search would get better performance.
I think the choice of 2x2
and 7x7
may vary depending on your task and dataset, you may try both of them or some other sizes to see which one performs better.
Have you tried other sizes such as 3x3 or 5x5? If these sizes are not suitable, why?
Have you tried other sizes such as 3x3 or 5x5? If these sizes are not suitable, why?
The block resolution of Vim with 224x224
input is 14x14
, and for VMamba are 56x56
, 28x28
, 14x14
, 7x7
.
Using other sizes on theses resolutions requires padding on the image, which may slightly increase the computation cost. That's the reason we did not try other sizes. However, our code supports other sizes and it will automatically pad the tensor if the height or width is not evenly divisible by the window size. So you can try other sizes if interested.
In other people's questions, it was mentioned that the window sizes used by the model are 2 and 7. I don't know where to view this parameter and how to adjust the corresponding code?