Memory issues for d_model above 100

MachRepo commented 8 months ago

Hello, I am trying to implement a mamba based model, whenever I try to increase d_model above 100 I get this error message. I am using torch.cuda.amp for mixed precision training.

here is the model :

class Model(nn.Module) :
    def __init__(self):
        super().__init__()
        L = 128
        self.E = torch.nn.Embedding(5,L, padding_idx=0)
        self.mamba1 = Mamba(d_model=L,d_state=16,d_conv=4,expand=2)
        self.mamba2 = Mamba(d_model=L,d_state=16,d_conv=4,expand=2)
        self.mamba3 = Mamba(d_model=L, d_state=16, d_conv=4, expand=2)
        self.mamba4 = Mamba(d_model=L, d_state=16, d_conv=4, expand=2)
        self.mamba5 = Mamba(d_model=L, d_state=16, d_conv=4, expand=2)
        self.mamba6 = Mamba(d_model=L, d_state=16, d_conv=4, expand=2)
        self.mamba7 = Mamba(d_model=L, d_state=16, d_conv=4, expand=2)
        self.mamba8 = Mamba(d_model=L, d_state=16, d_conv=4, expand=2)
        self.classifier = nn.Linear(L, 2)
    def forward(self,x):
        x = self.E(x)        
        x = self.mamba1(x)        
        x = self.mamba2(x)        
        x = self.mamba3(x)        
        x = self.mamba4(x)        
        x = self.mamba5(x)        
        x = self.mamba6(x)        
        x = self.mamba7(x)        
        x = self.mamba8(x)        
        out = self.classifier(x)
        return out

an here is the error message :

Traceback (most recent call last): File "/mnt/c/Users/Fadia/OneDrive/Bureau/GCPGPU/PC/Training.py", line 304, in history += fit(model, epochs, lr, train_loader, val_loader, mode) File "/mnt/c/Users/Fadia/OneDrive/Bureau/GCPGPU/PC/Training.py", line 224, in fit scaler.scale(loss).backward() File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, args) File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd return bwd(args, **kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 252, in backward dconv1d_out, ddelta, dA, dB, dC, dD, ddelta_bias, dz, out_z = selective_scan_cuda.bwd( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 3.34 GiB is allocated by PyTorch, and 54.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

tridao commented 8 months ago

The model seems very small, but the GPU also only has 4GB of memory? Maybe try different layers (e.g. MLP) of similar sizes to see if that also OOM. If yes then it's not an issue w Mamba.

MachRepo commented 8 months ago

@tridao I tried to use MLP instead and increased the size up to 500 and the model worked, what might be the problem ?

albertfgu commented 8 months ago

Did you check that the model sizes for the MLP vs Mamba match? Are you using the fast path of the Mamba block?

MachRepo commented 7 months ago

@albertfgu The mamba model is around 500k params and the mlp I used for test had 2 million parameters and still worked perfectly fine. Excuse my ignorance but how to know if I'm using the fast path of mamba ?

albertfgu commented 7 months ago

You can set a breakpoint or print statements inside the module to test if it's using the right path, like here: https://github.com/state-spaces/mamba/blob/009bec5ee37f586844a3fc89c040a9c1a9d8badf/mamba_ssm/modules/mamba_simple.py#L145

If it is, then I don't know what the problem is. It shouldn't be that much less efficient than an MLP. You're using the same batch size and sequence length for both models?

MachRepo commented 7 months ago

@albertfgu Hello again Mr. Gu, Thank you very much for your assistance. I indeed added a print statement inside the module, and got it printed when I used the module. I am using a unified training algorithm so I only change the model's content and launch my training, so yes sequence length (15000) and batch size (12) are the same. I tried to use mamba on it's own in the shell and got the same error message when choosing d_model greater than 100.

this works perfectly fine.

import torch
from mamba_ssm import Mamba
m = Mamba(d_model=100,d_state=16,d_conv=4,expand=2)
x = torch.rand(12, 15000, 100)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)

when I reduced the batch size to 10 it worked with d_model = 200

import torch
from mamba_ssm import Mamba
m = Mamba(d_model=200,d_state=16,d_conv=4,expand=2)
x = torch.rand(10, 15000, 200)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)

But when I execute these I get the same message :

import torch
from mamba_ssm import Mamba
m = Mamba(d_model=200,d_state=16,d_conv=4,expand=2)
x = torch.rand(12, 15000, 200)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)

Traceback (most recent call last): File "", line 1, in File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 137, in forward self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"), torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 550.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Process 42 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 3.13 GiB is allocated by PyTorch, and 43.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

PS : I am already training another model on my GPU that's why there is no memory left. but I don't get this message

Process 42 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use.

the wierd thing is that when I execute thses lines it works perfectly fine but once I execute them again I get the same error message :

>>> import torch
ort Mamba
m = Mamba(d_model=500,d_state=16,d_conv=4,expand=2)
x = torch.rand(9, 15000, 500)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)>>> from mamba_ssm import Mamba

>>> m = Mamba(d_model=500,d_state=16,d_conv=4,expand=2)
>>> x = torch.rand(9, 15000, 500)
>>> x = x.to('cuda')
>>> m = m.to('cuda')
>>> s = m(x)
>>> import torch
m mamba_>>> from mamba_ssm import Mamba
>>> m = Mamba(d_model=500,d_state=16,d_conv=4,expand=2)
 torch.rand(9, 15000, 500)
x = x.to('cuda')
m = m.to('cuda')
s = m(x)>>> x = torch.rand(9, 15000, 500)
>>> x = x.to('cuda')
>>> m = m.to('cuda')
>>> s = m(x)

Traceback (most recent call last): File "", line 1, in File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 146, in forward out = mamba_inner_fn( File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 306, in mamba_inner_fn return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight, File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/home/bkffadia/.local/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 113, in decorate_fwd return fwd(*args, **kwargs) File "/home/bkffadia/.local/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 181, in forward conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(x, conv1d_weight, conv1d_bias, None, True) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 516.00 MiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Process 42 has 17179869184.00 GiB memory in use. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 3.35 GiB is allocated by PyTorch, and 18.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

state-spaces / mamba

Memory issues for d_model above 100 #166