CUDA out of memory - Githubissues

anhnb206110 commented 3 weeks ago

Thank you for your work on this project!

I followed your instructions to train the model using the 'male-3-casual' dataset from the PeopleSnapshot dataset, without modifying any configurations in the config file. However, I encountered a CUDA out of memory error during training.

Here’s the error message I received:

Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.28 GiB (GPU 0; 39.39 GiB total capacity; 28.72 GiB already allocated; 1.14 GiB free; 34.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It appears that the memory usage increases with each epoch until it runs out of memory. Could you please help me understand why this is happening and suggest any possible solutions to resolve the issue?

trThanhnguyen commented 2 weeks ago

Hi @anhnb206110 , the increase in memory consumption is due to the kicking in of physical properties. You can find its setting in the configs/config.yaml file. Personally, I'm handling it by reducing the no. samples per pixel. Note that if so you'll need to change some related codes in models/intrinsic_avatar.py line 1392-1407. Also, I'm not sure if it were the right approach, so I hope you too can update your solution while we're waiting for the authors recommendation.

taconite commented 1 week ago

Thank you for your work on this project!

I followed your instructions to train the model using the 'male-3-casual' dataset from the PeopleSnapshot dataset, without modifying any configurations in the config file. However, I encountered a CUDA out of memory error during training.

Here’s the error message I received:
Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.28 GiB (GPU 0; 39.39 GiB total capacity; 28.72 GiB already allocated; 1.14 GiB free; 34.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
It appears that the memory usage increases with each epoch until it runs out of memory. Could you please help me understand why this is happening and suggest any possible solutions to resolve the issue?

Hi,

I tried a clean installation and tested on male-3-casual subject. Unfortunately, I did not run into any OOM issue. My current GPU has only 24GB VRAM (TITAN RTX). The VRAM usage peaks at ~21GB.

Could you specify during which epochs you observe memory growth? Does the OOM happen during training or validation (validation happens every 2000 steps - it is interleaved with training)

On the other hand, my test environment is Ubuntu 20.04/CentOS 7.9.2009 with Python 3.10, PyTorch 1.13 and CUDA 11.6. It might be helpful to align the software versions especially Python and PyTorch versions. For this project my pytorch-lightning version is 1.9.5 so it might also be good to know which pytorch-lightning version you are using.

Lastly, I do observe some GPU memory leak issue with pytorch-lightning in other projects. It mainly happens during inference (with torch.no_grad()) where certain tensors were supposed to be freed after inference but they are somehow kept in the memory. Maybe you can try disabling validation routine during training by adding trainer.val_check_interval=null, and see if the OOM issue persists.

taconite commented 1 week ago

Hi @anhnb206110 , the increase in memory consumption is due to the kicking in of physical properties. You can find its setting in the configs/config.yaml file. Personally, I'm handling it by reducing the no. samples per pixel. Note that if so you'll need to change some related codes in models/intrinsic_avatar.py line 1392-1407. Also, I'm not sure if it were the right approach, so I hope you too can update your solution while we're waiting for the authors recommendation.

I am not sure if you are encountering the same issue as @anhnb206110 - it seems that your GPU has limited memory to run at default SPP. TITAN RTX (24 GB) and above would be necessary to run the default config. If you want to use full SPP while reducing the VRAM usage, you can also try reducing model.secondary_shader_chunk to e.g. 80000 to trade training/inference speed for memory consumption.

taconite / IntrinsicAvatar

CUDA out of memory #6