state-spaces / mamba

Mamba SSM architecture
Apache License 2.0
13.34k stars 1.13k forks source link

some confusion about mamba block #281

Open CalebDu opened 7 months ago

CalebDu commented 7 months ago

i browse mamba block code and have some questions:

https://github.com/state-spaces/mamba/blob/12d855003ba92c8a15d1739ce65a14c6fb16e254/mamba_ssm/modules/mamba_simple.py#L92C5-L99C44

https://github.com/state-spaces/mamba/blob/12d855003ba92c8a15d1739ce65a14c6fb16e254/mamba_ssm/modules/mamba_simple.py#L240

in line 92-99, copy $softplus\ inverse(dt_ bias)$ to dt_proj.bias, dt_bias~uniform(d_min, dmax) . and in line 240, $dt = softplus(dt+ dt_proj.bias)$. according derivation, $softplus(softplus_ inverse(dt)) = dt_ bias$, so $dt = softplus(dt) + dt\ bias$. it is not consistent to "Algorithm 2 SSM + Selection (S6) " in paper. Why dt_bias do softplus_inv? Why initialize bias uniform(d_min, d_max) with $exp^{log(rand)}$, concerning numeric stability? 2.

https://github.com/state-spaces/mamba/blob/12d855003ba92c8a15d1739ce65a14c6fb16e254/mamba_ssm/modules/mamba_simple.py#L242C12-L242C51 in paper, $\bar{B} = (\Delta A)^{-1}(exp(\Delta A)-I)(\Delta B)$ , but in line 242, $\bar{B} = \Delta B$.

radarFudan commented 7 months ago

I believe this softplus parameterization ensures that for whatever weight value in softplus_inv, the variable dt is guaranteed to be between 0 and 1. Then during the training it would be more stable.

A potential relevant material is StableSSM(https://arxiv.org/abs/2311.14495). It theoretically shows the usage of exponential and softplus parameterizations helps the learning of long-term memory and training stability.

CalebDu commented 7 months ago

I believe this softplus parameterization ensures that for whatever weight value in softplus_inv, the variable dt is guaranteed to be between 0 and 1. Then during the training it would be more stable.

A potential relevant material is StableSSM(https://arxiv.org/abs/2311.14495). It theoretically shows the usage of exponential and softplus parameterizations helps the learning of long-term memory and training stability.

i revisit paper carafully, paper mentions that parameter $\Delta$ is initialized by $\tau^{-1}_{\Delta}(uniform([0.001,0.01])$ in chapter 3.6. it's my oversight.

sdreamforchen commented 7 months ago

How about question 2, I have the same confusion

gyu-heo commented 7 months ago

Question 2 has been answered in the issue #114 (closed). Long story short, the code implementation approximates paper implementation with easier computation.