hustvl / Vim

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Apache License 2.0
2.55k stars 159 forks source link

Flipped Residual Connection #71

Closed AliYoussef97 closed 1 month ago

AliYoussef97 commented 2 months ago

Hello,

Thank you for your amazing work!

From the paper, the last line in the Vim Alogirthim is as follows: $Tl$ : (B, M, D) ← $Linear^T$ ( $y{forward}$ + $y{backward}$ ) + $T{l−1}$

From the code the backward process is the same as the forward, just the input sequence is flipped. However, the input token sequence (residual) is added to the forward+backward output, but should the flipped token sequence be added as an residual as well, such as:

$Tl$ : (B, M, D) ← $Linear^T$ ( $y{forward}$ + $y{backward}$ ) + $T{l−1}$ + $T_{l−1 flipped}$


Edit:

I also noticed each "v2" Mamba block contains out_a and out_b, which is both forward and backward, but in the for loop here, we process two Mamba blocks at the same time, each has its out out_a and out_b, but the input for the second Mamba Block is flipped, which is qutie confusing, does that mean the flipped input for the second Mamba Block is not related to Mamba Block itself and mroe of a training mechanisim? Meaning, if the for loop processes one layer at a time, wouldn't a Mamba Block do a forward and backward SSM pass?

Thank you!

jsrdcht commented 1 month ago

I also noticed each "v2" Mamba block contains out_a and out_b, which is both forward and backward, but in the for loop here, we process two Mamba blocks at the same time, each has its out out_a and out_b, but the input for the second Mamba Block is flipped, which is qutie confusing, does that mean the flipped input for the second Mamba Block is not related to Mamba Block itself and mroe of a training mechanisim? Meaning, if the for loop processes one layer at a time, wouldn't a Mamba Block do a forward and backward SSM pass?

I found the same problem, kick me if you find sth. else

AliYoussef97 commented 1 month ago

@jsrdcht if_bidirectional is set as False and flip_img_sequences_ratio is set as -1, thus, the input is processed normally as forward and backward "v2" here. Hope that helps!

jsrdcht commented 1 month ago

@jsrdcht if_bidirectional is set as False and flip_img_sequences_ratio is set as -1, thus, the input is processed normally as forward and backward "v2" here. Hope that helps!

Thanks a lot! I think you are right. However, there is still one issue about the depth of the model. Why the default depth is 24 in the sense of small/tiny model if it's not for bidirection? I mean, considering the ViT/small sets it to 12.

AliYoussef97 commented 1 month ago

@jsrdcht The model itself is bidirectional, and can be found here. The if_bidirectional parameter being False just ensures that we do not use this loop. So with if_bidirectional being False and flip_img_sequences_ratio as -1, the input is fed here directly, which goes to v2 (the first link in this comment). The small and tiny do not differ in depth, but differ in the hidden state dimension (3.4. Architecture Details in the paper).

jsrdcht commented 1 month ago

@AliYoussef97 Yeah, I agree with you.

The small and tiny do not differ in depth, but differ in the hidden state dimension (3.4. Architecture Details in the paper).

That's not my point. Based on what you just mentioned, so each Vision Mamba block implements both forward and backward modules internally, can we simply compare each mamba block to a transformer block?

In the configuration of vision transformer small, the depth is 12 (possibly even lower), while for vision mamba it is set to 24. Assuming if_bidirectional is True, 24 can be interpreted as having two types of blocks. How should we understand this default parameter of 24 if if_bidirectional is False?

See another repo , their depth is set to 12.

AliYoussef97 commented 1 month ago

@jsrdcht The linked repo is not the official implementation, thus I am not quite sure why the depth is 12, but by default, the paper states the difference is in the Hidden state dimension. if_bidirectional is related to the ablation study in the paper, and should be False from my understanding.