CUDA out of memory - Githubissues

durach commented 11 months ago

Description

The issue could be related to #116. I am adapting the Prototyping Networks for my case. I noticed that you're using an adjusted version of ResNet in your examples. Based on my experiments, this version is not a direct replacement for standard torch ResNet implementation. At least it consumes more GPU memory in the same circumstances.

How To Reproduce

import torch
import torch.nn
import torch.optim
import easyfsl.modules

DEVICE = torch.device('cuda')

model = easyfsl.modules.resnet18(use_fc=True, num_classes=100).to(DEVICE)

LOSS_FUNCTION = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

SIZE = 6

model.train()
for i in range(5):
    print(f"Epoch {i}")

    optimizer.zero_grad()

    images = torch.rand(SIZE, 3, 224, 224)
    labels = torch.randint(0, 100, (SIZE,))

    print(images.shape)
    print(labels.shape)

    model_output = model(images.to(DEVICE))
    loss = LOSS_FUNCTION(model_output, labels.to(DEVICE))
    loss.backward()

    optimizer.step()

Output:

Epoch 0
torch.Size([6, 3, 224, 224])
torch.Size([6])

{
    "name": "OutOfMemoryError",
    "message": "CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 7.79 GiB of which 6.31 MiB is free. Process 290479 has 6.22 GiB memory in use. Including non-PyTorch memory, this process has 1.55 GiB memory in use. Of the allocated memory 1.36 GiB is allocated by PyTorch, and 52.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF",
    "stack": "---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[1], line 29
     27 model_output = model(images.to(DEVICE))
     28 loss = LOSS_FUNCTION(model_output, labels.to(DEVICE))
---> 29 loss.backward()
     31 optimizer.step()

File ~/.pyenv/versions/3.11.6/envs/satellite-dataset/lib/python3.11/site-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File ~/.pyenv/versions/3.11.6/envs/satellite-dataset/lib/python3.11/site-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 7.79 GiB of which 6.31 MiB is free. Process 290479 has 6.22 GiB memory in use. Including non-PyTorch memory, this process has 1.55 GiB memory in use. Of the allocated memory 1.36 GiB is allocated by PyTorch, and 52.65 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"
}

The same code with SIZE=5 or with "native" ResNet Module doesn't cause issue:

import torch
import torch.nn
import torch.optim
import torchvision.models

DEVICE = torch.device('cuda')

model = torchvision.models.resnet18(num_classes=100).to(DEVICE)

LOSS_FUNCTION = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)

SIZE = 6

model.train()
for i in range(5):
    print(f"Epoch {i}")

    optimizer.zero_grad()

    images = torch.rand(SIZE, 3, 224, 224)
    labels = torch.randint(0, 100, (SIZE,))

    print(images.shape)
    print(labels.shape)

    model_output = model(images.to(DEVICE))
    loss = LOSS_FUNCTION(model_output, labels.to(DEVICE))
    loss.backward()

    optimizer.step()

Additional context The code above was executed on

NVIDIA GeForce RTX 2070 SUPER, Memory: 7.79 GB, Driver Version: 520.61.05, CUDA Version: 11.8
Torch Version: 2.1.1+cu118
GPU 1: NVIDIA GeForce RTX 2070 SUPER, Memory: 7.79 GB
Torchvision version: 0.16.1+cu118

I tried Google Colab, and with T4 16Gb, I reached batch sizes of 64 and 512 with EastFSL and Torch version of ResNet18.

I am a very beginner in ML and may misuse the framework.

ebennequin commented 11 months ago

Custom ResNets in easyfsl come from this implementation which is now a not-so-recent fork of PyTorch's ResNet, and it is highly possible that its memory usage is suboptimal.

Quick response to this problem would be to make this clear this in our custom ResNet's docstring.

Better response would be to start a deep study of the differences between this implementation and PyTorch's and find how to improve our memory usage.

Best response would probably be to reimplement our custom ResNet to extend PyTorch's and reduce to the minimum the differences between the two.

Note that the last two options could cause unexpected shifts between the results obtained with easyfsl and the other works that use FiveAI's implementation, but it's probably worth it: easyfsl is meant to improve best practices in the field.

durach commented 11 months ago

Thank you. I'll discover both architectures and come up with improvements if I have such. For now, I've switched to the native PyTorch implementation; it eats less memory and is way faster, at least for my case.

Should I close the issue?

ebennequin commented 11 months ago

No it's definitely an active issue and deserves to be addressed.

sicara / easy-few-shot-learning

CUDA out of memory #130