hunto / LocalMamba

Code for paper LocalMamba: Visual State Space Model with Windowed Selective Scan
Apache License 2.0
193 stars 10 forks source link

Confusion about the global and the local scan? #10

Closed gfdsah236 closed 6 months ago

gfdsah236 commented 6 months ago

Why does (a) scan provide global representation and (c) provide local representation in the manuscript of Figure 1? In my opinion, both the (a) scan sequence and the (c) scan sequence are sent to the SSM module, which captures global correlations, and this is characteristic of the SSM module.

Missyfirst commented 6 months ago

I have the same question with you @gfdsah236

hunto commented 6 months ago

Hi @gfdsah236 @Missyfirst ,

As discussed in our paper and VMamba paper, the scan orders of tokens matter due to the causal scanning concept of Mamba. Similar to the recurrent neural network and convolution, this causal scanning (causal_conv1d) is also influenced by the distances between timesteps and window size. Consequently, a tighter organizations of semantically-related local tokens would have better representation performance, though the SSM can be overall considerred as global operation.

Another explanation of how the discontinuity in adjacent tokens affects the performance in PlainMamba: Moreover, as the parameter A in Equation 3 serves as a decaying term, such spatial discontinuity can also cause adjacent tokens to be decayed to different degrees, compounding the semantic discontinuity and resulting in potential performance drop.

gfdsah236 commented 6 months ago

Hi @gfdsah236 @Missyfirst ,

As discussed in our paper and VMamba paper, the scan orders of tokens matter due to the causal scanning concept of Mamba. Similar to the recurrent neural network and convolution, this causal scanning (causal_conv1d) is also influenced by the distances between timesteps and window size. Consequently, a tighter organizations of semantically-related local tokens would have better representation performance, though the SSM can be overall considerred as global operation.

Another explanation of how the discontinuity in adjacent tokens affects the performance in PlainMamba: Moreover, as the parameter A in Equation 3 serves as a decaying term, such spatial discontinuity can also cause adjacent tokens to be decayed to different degrees, compounding the semantic discontinuity and resulting in potential performance drop.

Thanks a lot, I figure out it.