Open Aristo23333 opened 2 months ago
I also have the same question about the B and C matrices. Do you have any new ideas to share?
Yes, we tied $B$ and $C$ across all the channels. In Mamba-2 we called this "multi-value attention" or "multi-expand SSM" head structure.
Thank you for your reply! And I have another question that I find A can change its value during training but maintain a diagonal structure. Have ever explore whether the distribution of these diagonal elements change could make a influence on the performance? May in S4? What I concern most is the distribution of the value, not only the structure. Thank you!
While not on the distribution of trained weights, I would highly recommend this paper on the parameterization of SSMs https://arxiv.org/pdf/2206.11893, which is a successor to this paper https://arxiv.org/pdf/2008.07669 .
However, I will say in the appendix of Mamba they abalate on the parameterization of selective SSMs and note that random is completely fine.
Another interesting thing not really mention is that they ensure that A is always negative. To see this look at this line. This has to do with ensuring that the recurrent dynamics decay inputs through time.
While not on the distribution of trained weights, I would highly recommend this paper on the parameterization of SSMs https://arxiv.org/pdf/2206.11893, which is a successor to this paper https://arxiv.org/pdf/2008.07669 .
However, I will say in the appendix of Mamba they abalate on the parameterization of selective SSMs and note that random is completely fine.
Another interesting thing not really mention is that they ensure that A is always negative. To see this look at this line. This has to do with ensuring that the recurrent dynamics decay inputs through time.
Thank you so much for your information and reference
Hi, When I print the dimensions of A, B, and C, I see that the dimensions of B and C are (batch_size, 1, d_state, seq_length), which makes sense according to the paper. However, the shape of A is (expand_dim d_model, d_state), which is weird since A should also be input-dependent after discretization. Can anyone explain why the dimension of A is not (batch_size, expand_dim d_model, d_state, seq_length)?
When you pass it to the function undiscretized. Their hardware aware algorithm will discretize in local SRAM on the GPU to save memory and time.
Remember that the discretization process is exp(delta * A) where A is the same at all time steps but delta is different. Look at the code and you will notice that we pass both delta and A, meaning the CUDA function will do it for you.
Hope this helps!
Thanks! And the discretization of B also happens in the CUDA function, right? I see that if I want to get A and B after discretization using Python, I can use the code from the selective_scan_ref function in selective_scan_interface.py. I used the function's code up to the for loop to create the following function:
`def get_params_after_discretization(delta, A, B, C, delta_bias=None, delta_softplus=False): #The code of this function was taken from 'selective_scan_ref' temp_delta = delta.float() if delta_bias is not None: temp_delta = temp_delta + delta_bias[..., None].float() if delta_softplus: temp_delta = F.softplus(temp_delta)
dim = A.shape[0]
is_variable_B = B.dim() >= 3
is_variable_C = C.dim() >= 3
if A.is_complex():
print(f"DEBUG: get_params_after_discretization: A.is_complex()")
if is_variable_B:
temp_B = torch.view_as_complex(rearrange(B.float(), "... (L two) -> ... L two", two=2))
if is_variable_C:
temp_C = torch.view_as_complex(rearrange(C.float(), "... (L two) -> ... L two", two=2))
else:
temp_B = B.float()
temp_C = C.float()
delta_A = torch.exp(torch.einsum('bdl,dn->bdln', temp_delta, A))
print(f"DEBUG: get_params_after_discretization temp_B.dim()={temp_B.dim()} temp_C.dim()={temp_C.dim()}")
if not is_variable_B:
delta_B = torch.einsum('bdl,dn->bdln', temp_delta, temp_B)
else:
if temp_B.dim() == 3:
delta_B = torch.einsum('bdl,bnl->bdln', temp_delta, temp_B)
else:
temp_B = repeat(temp_B, "B G N L -> B (G H) N L", H=dim // temp_B.shape[1])
delta_B = torch.einsum('bdl,bdnl->bdln', temp_delta, temp_B)
if is_variable_C and temp_C.dim() == 4:
temp_C = repeat(temp_C, "B G N L -> B (G H) N L", H=dim // temp_C.shape[1])
print(f"DEBUG: get_params_after_discretization: delta_A.shape={delta_A.shape} delta_B.shape={delta_B.shape} temp_C.shape={temp_C.shape}")
return delta_A, delta_B, temp_C`
I am calling this function right after the call to selective_scan_cuda.fwd.
Do you think this is the correct way to get the parameters after discretization without entering the CUDA function?
Dear Authors, Thanks for your brilliant works! Now I am learning about the detailed change of parameter shape in your code and in your paper. I noticed that the A is (D,N), which represents a NN matrix with diagonal structure, right? It seems to represent there are D NN matrixes for D channels? The same situation for D. But I get some confused about the B, C which is (B,L,N) because they seems not to contain the information of different channls. Are they all the same for D channels? Thank you!