Open MstarLioning opened 3 months ago
$B$ in Mamba is analogous to $K$ in attention (not $W_k$)
@MstarLioning
Well my take on S4 ist that you enforce structure on you parameters in SSM. Than you do convolution that is implicitly parametrized by the SSM. The downside of this approach is that your parameters are not functions of your input. In Mamba you actually perform projections of the input to get your parameters A,B,C and X. And with Mamba-2 you do the same but the model is a bit more contrained.
Hello. I am currently reading Mamba-1 and there is one point I don't quite understand. In the comparison with the S4 paper, it is mentioned that in order to make Mamba dependent on the input, we change matrix B from Hidden state size Size of input vector to batch size Hidden state size Sequence Length. However, similar to Transformers, aren't the Wq, Wk, and Wv in Transformers also of the size Hidden state size Size of input vector? So why does incorporating sequence length and batch size resolve the content-aware issue? I hope to receive your reply and am deeply grateful!