learnables / learn2learn

A PyTorch Library for Meta-learning Research
http://learn2learn.net
MIT License
2.66k stars 353 forks source link

Potential Memory Leak Error #284

Closed pandeydeep9 closed 2 years ago

pandeydeep9 commented 2 years ago

I installed learn2learn using "pip install learn2learn". When I try to run maml_miniimagenet.py (from learn2learn/examples/vision/maml_miniimagenet.py ) with a batch size of 2 and shot = 1, I get the same error after 63 iterations. When I change to shot = 5, I get the error after 3 iterations.

Iteration 63 Meta Train Error 2.0417345762252808 Meta Train Accuracy 0.20000000298023224 Meta Valid Error 1.8002310991287231 Meta Valid Accuracy 0.20000000298023224 Traceback (most recent call last): File "/home/deep/Desktop/IMPLEMENTATION/MyTry/MetaSGD/mini_Temp_Test.py", line 156, in main() File "/home/deep/Desktop/IMPLEMENTATION/MyTry/MetaSGD/mini_Temp_Test.py", line 106, in main evaluation_error.backward() File "/home/deep/.local/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/deep/.local/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.79 GiB total capacity; 3.60 GiB already allocated; 77.56 MiB free; 3.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

When I look at nvidia-smi, the memory usage gradually increases with each iteration. However, If I comment out the meta-validation loss part, (line 114-112 in this script) then I don't get the memory leak problem. I think the issue is similar to (Potential Memory Leak #278 ) I wonder why this issue is and how the issue can be solved?

Phoveran commented 2 years ago

Actually I was just occupied by another project, so I had not solved this but closed the issue. Maybe the problem does exist.

seba-1511 commented 2 years ago

Thanks for raising the issue @pandeydeep9 and @Phoveran,

These leaks are worrisome. Could you share more about your setup? Which GPU, CPU, and versions of Python, PyTorch, and learn2learn? It seems to be hardward-dependent since @nightlessbaron wasn’t able to reproduce the bug on Colab. Also, are you running the mini-imagenet script as-is?

pandeydeep9 commented 2 years ago

I reduced the meta_batch_size parameter to 2 and shots to 1. That is the only change I made in the example mini-imagenet script. My CPU is Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz. My GPU is TU106M [GeForce RTX 2060 Mobile] , version: a1, clock 33MHz. I tried on Python 3.8.5, learn2learn version 0.1.5, and torch version is '1.10.0+cu102'

tranquangchung commented 2 years ago

I have the same issue, even I run "maml_miniimagenet.py" on A5000 with 24GB After a few iterations, It gives the following error message "CUDA out of memory" I try many version 0.1.3, 0.1.4, 0.1.5, 0.1.6 So I think that if I use this library for my project, I will face much trouble in the future. Can you give me some advice or how to fix it? Many Thanks

seba-1511 commented 2 years ago

Thanks for the additional feedback. Are you also using PyTorch v.1.10? And does commenting out the validation step also fix the memory leak?

tranquangchung commented 2 years ago

Yes, I use Pytorch 1.10.0, CUDA Version: 11.2, python 3.8.12,

Phoveran commented 2 years ago

My setting:

python: 3.9.7 learn2learn: 0.1.6 (using pip install learn2learn) PyTorch: 1.10 CUDA: 11.4 GPU: RTX 2080Ti

seba-1511 commented 2 years ago

Thanks for the extra info, I wonder if the issue cropped up with PyTorch 1.10 on CUDA 11+. As a temporary fix, does changing learner = maml.clone() to learner = maml.clone(first_order=True) on l. 112 solve the leak for you?

pandeydeep9 commented 2 years ago

Yes, adding first_order=True on l. 112 solves the leak problem. Also, I guess this should give the expected results as I believe we can use the first order MAML during the validation/test phases and get the same results (i.e. do not need to track gradients for MAML during test/validation phases).

Thanks

ligeng0197 commented 2 years ago

We meet the same case and our setting is pytorch 1.10.0, python 3.9.5, cuda 11.5, tesla m40(24G). We are glad to see this issue published since we debug our code repeatedly and have no idea what's causing the increasing cuda memory occupation over val or test iterations.

tobiasvanderwerff commented 2 years ago

I was facing the same issue, but managed to solve it by downgrading Pytorch from version 1.10 to 1.9. I was using the following setup:

learn2learn 0.1.6 Python 3.8.6 Pytorch 1.10 GPU: Nvidia V100 (32gb) Cuda 10.2

Using this setup, memory usage kept increasing over epochs until an out-of-memory error occurred. However, when using Pytorch 1.9, memory usage stabilizes.

sjtugzx commented 2 years ago

Thanks for the extra info, I wonder if the issue cropped up with PyTorch 1.10 on CUDA 11+. As a temporary fix, does changing learner = maml.clone() to learner = maml.clone(first_order=True) on l. 112 solve the leak for you?

Honestly, I tried to use maml for finetuning T5 transformer, befor adding "first_order=True", I just could run 2 tps, however, this way couldn't fix my problem. After adding this parameter, I could run 4 tps, but still got memory leak. I gues there are still some problems and exposed by huge networks such as transformer.

learn2learn 0.1.6 Python 3.9 Pytorch 1.10 GPU: 3080 (24GB) Cuda 10.2

seba-1511 commented 2 years ago

The memory leak seems to have been introduced in PyTorch 1.10. @sjtugzx do you also see leaks with T5 on PyTorch 1.9?

I haven't had time to investigate it yet, so help is welcome.

kzhang2 commented 2 years ago

I have a suggestion for a potential fix. It is a little bit hacky though. In my observations, the key problem leading to the memory leak seems to be that the compute graph for the gradient update is being created, even when first_order=True. During training, I think the memory doesn't accumulate because the compute graph gets flushed when you do loss.backward(). However, at evaluation time, you never need to call loss.backward(), so there's a possibility the memory usage scales wildly.

In my code, what I've done to get rid of this extra unneeded memory usage at evaluation time is to add a eval flag to the adapt function inside MAML and MetaSGD which causes the gradient update to be wrapped in a no_grad context, so

# Update the module
self.module = maml_update(self.module, self.lr, gradients)

becomes

# Update the module
if eval:
    with torch.no_grad():
        self.module = maml_update(self.module, self.lr, gradients)
    for p in self.module.parameters():
        p.requires_grad = True
else:
    self.module = maml_update(self.module, self.lr, gradients)

I haven't investigated this in detail so I'm not sure if this is the best way to proceed, but let me know if this seems promising and if I should investigate further, and maybe even make a pull request.

seba-1511 commented 2 years ago

For people following, @kzhang2 and I have been discussing on slack and we came up with a fix. Expect a PR + release in the next 2 weeks. Meanwhile, the fix is to update the update_module function in learn2learn/utils/__init__.py as follows:

def update_module(module, updates=None, memo=None):
    r"""
    [[Source]](https://github.com/learnables/learn2learn/blob/master/learn2learn/utils.py)

    **Description**

    Updates the parameters of a module in-place, in a way that preserves differentiability.

    The parameters of the module are swapped with their update values, according to:
    \[
    p \gets p + u,
    \]
    where \(p\) is the parameter, and \(u\) is its corresponding update.

    **Arguments**

    * **module** (Module) - The module to update.
    * **updates** (list, *optional*, default=None) - A list of gradients for each parameter
        of the model. If None, will use the tensors in .update attributes.

    **Example**
    ~~~python
    error = loss(model(X), y)
    grads = torch.autograd.grad(
        error,
        model.parameters(),
        create_graph=True,
    )
    updates = [-lr * g for g in grads]
    l2l.update_module(model, updates=updates)
    ~~~
    """
    if memo is None:
        memo = {}
    if updates is not None:
        params = list(module.parameters())
        if not len(updates) == len(list(params)):
            msg = 'WARNING:update_module(): Parameters and updates have different length. ('
            msg += str(len(params)) + ' vs ' + str(len(updates)) + ')'
            print(msg)
        for p, g in zip(params, updates):
            p.update = g

    # Update the params
    for param_key in module._parameters:
        p = module._parameters[param_key]
        if p is not None and hasattr(p, 'update') and p.update is not None:
            if p in memo:
                module._parameters[param_key] = memo[p]
            else:
                updated = p + p.update
                p.update = None
                memo[p] = updated
                module._parameters[param_key] = updated

    # Second, handle the buffers if necessary
    for buffer_key in module._buffers:
        buff = module._buffers[buffer_key]
        if buff is not None and hasattr(buff, 'update') and buff.update is not None:
            if buff in memo:
                module._buffers[buffer_key] = memo[buff]
            else:
                updated = buff + buff.update
                buff.update = None
                memo[buff] = updated
                module._buffers[buffer_key] = updated

    # Then, recurse for each submodule
    for module_key in module._modules:
        module._modules[module_key] = update_module(
            module._modules[module_key],
            updates=None,
            memo=memo,
        )

    # Finally, rebuild the flattened parameters for RNNs
    # See this issue for more details:
    # https://github.com/learnables/learn2learn/issues/139
    if hasattr(module, 'flatten_parameters'):
        module._apply(lambda x: x)
    return module
seba-1511 commented 2 years ago

Quick update: this is fixed, tested, and available in the new v0.1.7 release.

aritroCoder commented 1 month ago

Hi, I am using learn2learn and getting memory leak error. This is the code I am using:

#Load model weights
model.load_state_dict(torch.load('mnist_model_weights_450.pth', map_location={'cuda:2' : 'cuda:0'}))

# run the test data
meta_test_loss = 0.0
for idx, (context_x, context_y, target_x, target_y) in enumerate(test_loader):
    context_x, context_y, target_x, target_y = context_x.to(device), context_y.to(device), target_x.to(device), target_y.to(device)
    effective_batch_size = context_x.size(0)
    for i in range(effective_batch_size):
        learner = maml.clone(first_order=True)
        x_support, y_support = context_x[i], context_y[i]
        x_query, y_query = target_x[i], target_y[i]
        y_support = y_support.view(-1)
        y_query = y_query.view(-1)
        for _ in range(num_epochs):
            wts, predictions = learner(x_support)
            loss = custom_loss_function(predictions, y_support, wts)
            learner.adapt(loss)
        wts, predictions = learner(x_query)
        loss = custom_loss_function(predictions, y_query, wts)
        meta_test_loss += loss
    meta_test_loss /= effective_batch_size
    if idx % 10 == 0:
        print(f"Iteration: {idx+1}, Meta test loss: {meta_test_loss}")

print(f"Final Meta test loss: {meta_test_loss}")

I am getting this error:


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 3.06 MiB is free. Process 54265 has 14.74 GiB memory in use. Of the allocated memory 14.51 GiB is allocated by PyTorch, and 102.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  

learn2learn 0.2.0 Python 3.9 Pytorch 2.4.0+cu121 (using google colab) GPU: T4 (15GB) Cuda 12.2

Can anyone tell me how to fix it? I wrote the training loop similiarly but it runs