Hi,
Thanks for providing the Mamba implementation. I would like to know if there is any workaround in the efficient computation of deltaA and deltaB_u that can avoid the GPU memory running out issue. The following are the parameters I used to create the Mamba instance:
d_model: 1024
n_layer: 4
d_state: int = 1024
expand: int = 2
The other parameters are set to their default values.
It results in a model of ~60M parameters. However, I run out of memory (max GPU memory= 24 GB) when I train with a batch size of 256 or even as low as 64 and this probably happens due to large matrix computations for deltaA and deltaB_u.
This repo is mostly meant for educational purpose, and I would suggest using the official repo to do any training: https://github.com/state-spaces/mamba
Hi, Thanks for providing the Mamba implementation. I would like to know if there is any workaround in the efficient computation of
deltaA
anddeltaB_u
that can avoid the GPU memory running out issue. The following are the parameters I used to create the Mamba instance:The other parameters are set to their default values.
It results in a model of ~60M parameters. However, I run out of memory (max GPU memory= 24 GB) when I train with a batch size of 256 or even as low as 64 and this probably happens due to large matrix computations for
deltaA
anddeltaB_u
.