Open MachRepo opened 8 months ago
The model seems very small, but the GPU also only has 4GB of memory? Maybe try different layers (e.g. MLP) of similar sizes to see if that also OOM. If yes then it's not an issue w Mamba.
@tridao I tried to use MLP instead and increased the size up to 500 and the model worked, what might be the problem ?
Did you check that the model sizes for the MLP vs Mamba match? Are you using the fast path of the Mamba block?
@albertfgu The mamba model is around 500k params and the mlp I used for test had 2 million parameters and still worked perfectly fine. Excuse my ignorance but how to know if I'm using the fast path of mamba ?
You can set a breakpoint or print statements inside the module to test if it's using the right path, like here: https://github.com/state-spaces/mamba/blob/009bec5ee37f586844a3fc89c040a9c1a9d8badf/mamba_ssm/modules/mamba_simple.py#L145
If it is, then I don't know what the problem is. It shouldn't be that much less efficient than an MLP. You're using the same batch size and sequence length for both models?
@albertfgu Hello again Mr. Gu, Thank you very much for your assistance. I indeed added a print statement inside the module, and got it printed when I used the module. I am using a unified training algorithm so I only change the model's content and launch my training, so yes sequence length (15000) and batch size (12) are the same. I tried to use mamba on it's own in the shell and got the same error message when choosing d_model greater than 100.
import torch
from mamba_ssm import Mamba
m = Mamba(d_model=100,d_state=16,d_conv=4,expand=2)
x = torch.rand(12, 15000, 100)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)
import torch
from mamba_ssm import Mamba
m = Mamba(d_model=200,d_state=16,d_conv=4,expand=2)
x = torch.rand(10, 15000, 200)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)
import torch
from mamba_ssm import Mamba
m = Mamba(d_model=200,d_state=16,d_conv=4,expand=2)
x = torch.rand(12, 15000, 200)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)
Traceback (most recent call last): File "
", line 1, in File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 137, in forward self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 550.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Process 42 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 3.13 GiB is allocated by PyTorch, and 43.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
PS : I am already training another model on my GPU that's why there is no memory left. but I don't get this message
Process 42 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use.
the wierd thing is that when I execute thses lines it works perfectly fine but once I execute them again I get the same error message :
>>> import torch
ort Mamba
m = Mamba(d_model=500,d_state=16,d_conv=4,expand=2)
x = torch.rand(9, 15000, 500)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)>>> from mamba_ssm import Mamba
>>> m = Mamba(d_model=500,d_state=16,d_conv=4,expand=2)
>>> x = torch.rand(9, 15000, 500)
>>> x = x.to('cuda')
>>> m = m.to('cuda')
>>> s = m(x)
>>> import torch
m mamba_>>> from mamba_ssm import Mamba
>>> m = Mamba(d_model=500,d_state=16,d_conv=4,expand=2)
torch.rand(9, 15000, 500)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)>>> x = torch.rand(9, 15000, 500)
>>> x = x.to('cuda')
>>> m = m.to('cuda')
>>> s = m(x)
Traceback (most recent call last): File "
", line 1, in File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 146, in forward out = mamba_inner_fn( File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 306, in mamba_inner_fn return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight, File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 113, in decorate_fwd return fwd(*args, **kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 181, in forward conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(x, conv1d_weight, conv1d_bias, None, True) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 516.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Process 42 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 3.35 GiB is allocated by PyTorch, and 18.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Hello, I am trying to implement a mamba based model, whenever I try to increase d_model above 100 I get this error message. I am using torch.cuda.amp for mixed precision training.
here is the model :
an here is the error message :