hunto / LocalMamba

Code for paper LocalMamba: Visual State Space Model with Windowed Selective Scan
Apache License 2.0
204 stars 12 forks source link

Where can I adjust the window size #7

Closed yuhua666 closed 7 months ago

yuhua666 commented 7 months ago

In other people's questions, it was mentioned that the window sizes used by the model are 2 and 7. I don't know where to view this parameter and how to adjust the corresponding code?

yuhua666 commented 7 months ago

I have found it. May I ask how this window size affects performance?

hunto commented 7 months ago

Hi @yuhua666 ,

According to our experiments on LocalVim, using 2x2 scan solely performs better than using 7x7 solely, as 7x7 window may be too large to capture small details in 14x14 tokens. Besides, a combination of 2x2 and 7x7 through search would get better performance.

I think the choice of 2x2 and 7x7 may vary depending on your task and dataset, you may try both of them or some other sizes to see which one performs better.

yuhua666 commented 7 months ago

Have you tried other sizes such as 3x3 or 5x5? If these sizes are not suitable, why?

hunto commented 7 months ago

Have you tried other sizes such as 3x3 or 5x5? If these sizes are not suitable, why?

The block resolution of Vim with 224x224 input is 14x14, and for VMamba are 56x56, 28x28, 14x14, 7x7.

Using other sizes on theses resolutions requires padding on the image, which may slightly increase the computation cost. That's the reason we did not try other sizes. However, our code supports other sizes and it will automatically pad the tensor if the height or width is not evenly divisible by the window size. So you can try other sizes if interested.