Open liang00fan opened 2 weeks ago
for question1: in my option, it looks like delta is the length of discretization, in order to get the dependent thing, it has to add + $S_\Delta$, but in the paper, it says
, why remove "Parameter+",it's not the same to the code:
dt = self.dt_proj.weight @ dt.t()
Parameter
here is the dt_bias. Linear(x_t) means self.dt_proj.weight @ dt.t() + dt_bias
. In the code we separately do self.dt_proj.weight @ dt.t()
and then the dt_bias is added in a separate slep (in the (CUDA kernel).
Hello @tridao @albertfgu, I am very interested in this point as well. Could you please elaborate a bit more on how adding this learnable dt_bias
to any input dependent $S_{\Delta}(x_t)
$ can ensure the $\Delta
$ is at the right magnitude? May I ask which previous SSM paper discussed this?
Much appreciated!
S4 code has a parameter dt that should be in the range of 1e-3 to 1e-1 (these are hyperparameters that you can change). In Mamba's case we want softplus(x @ weight + dt_bias) to be around that range. We can assume that x @ weight has zero mean at initialization, so we initialize dt_bias so that softplus(dt_bias) is in the range 1e-3 to 1e-1. Ofc this does not guarantee it will stay in this range as the model is trained, but only at initialization. https://github.com/state-spaces/mamba/blob/28b1435eb56c3082a243d23253ee7676ad737c09/mamba_ssm/modules/mamba_simple.py#L91
question1:why need add $Parameter+$? question2:why it use self.dt_proj.weight @ dt.t()? question3:those line become
delta = F.softplus(self.dt_proj(delta))
,is that the same?