Closed AliYoussef97 closed 1 month ago
I also noticed each "v2" Mamba block contains out_a and out_b, which is both forward and backward, but in the for loop here, we process two Mamba blocks at the same time, each has its out out_a and out_b, but the input for the second Mamba Block is flipped, which is qutie confusing, does that mean the flipped input for the second Mamba Block is not related to Mamba Block itself and mroe of a training mechanisim? Meaning, if the for loop processes one layer at a time, wouldn't a Mamba Block do a forward and backward SSM pass?
I found the same problem, kick me if you find sth. else
@jsrdcht if_bidirectional
is set as False
and flip_img_sequences_ratio
is set as -1
, thus, the input is processed normally as forward and backward "v2" here. Hope that helps!
@jsrdcht
if_bidirectional
is set asFalse
andflip_img_sequences_ratio
is set as-1
, thus, the input is processed normally as forward and backward "v2" here. Hope that helps!
Thanks a lot! I think you are right. However, there is still one issue about the depth of the model. Why the default depth is 24 in the sense of small/tiny model if it's not for bidirection? I mean, considering the ViT/small sets it to 12.
@jsrdcht The model itself is bidirectional, and can be found here. The if_bidirectional
parameter being False
just ensures that we do not use this loop. So with if_bidirectional
being False
and flip_img_sequences_ratio
as -1
, the input is fed here directly, which goes to v2
(the first link in this comment). The small and tiny do not differ in depth, but differ in the hidden state dimension (3.4. Architecture Details in the paper).
@AliYoussef97 Yeah, I agree with you.
The small and tiny do not differ in depth, but differ in the hidden state dimension (3.4. Architecture Details in the paper).
That's not my point. Based on what you just mentioned, so each Vision Mamba block implements both forward and backward modules internally, can we simply compare each mamba block to a transformer block?
In the configuration of vision transformer small, the depth is 12
(possibly even lower), while for vision mamba it is set to 24
. Assuming if_bidirectional
is True
, 24
can be interpreted as having two types of blocks. How should we understand this default parameter of 24
if if_bidirectional
is False
?
See another repo , their depth is set to 12.
@jsrdcht The linked repo is not the official implementation, thus I am not quite sure why the depth is 12, but by default, the paper states the difference is in the Hidden state dimension. if_bidirectional
is related to the ablation study in the paper, and should be False
from my understanding.
Hello,
Thank you for your amazing work!
From the paper, the last line in the Vim Alogirthim is as follows: $Tl$ : (B, M, D) ← $Linear^T$ ( $y{forward}$ + $y{backward}$ ) + $T{l−1}$
From the code the backward process is the same as the forward, just the input sequence is flipped. However, the input token sequence (residual) is added to the forward+backward output, but should the flipped token sequence be added as an residual as well, such as:
$Tl$ : (B, M, D) ← $Linear^T$ ( $y{forward}$ + $y{backward}$ ) + $T{l−1}$ + $T_{l−1 flipped}$
Edit:
I also noticed each "v2" Mamba block contains out_a and out_b, which is both forward and backward, but in the for loop here, we process two Mamba blocks at the same time, each has its out out_a and out_b, but the input for the second Mamba Block is flipped, which is qutie confusing, does that mean the flipped input for the second Mamba Block is not related to Mamba Block itself and mroe of a training mechanisim? Meaning, if the for loop processes one layer at a time, wouldn't a Mamba Block do a forward and backward SSM pass?
Thank you!