CUDA out of memory when running train.sh

rfuruta commented 1 year ago

Hi, when I try to run train.sh on two GPUs (TITAN RTX) with 24GB VRAM, I get the following errors:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 23.65 GiB total capacity; 21.76 GiB already allocated; 24.75 MiB free; 21.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF RuntimeError: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 23.65 GiB total capacity; 21.18 GiB already allocated; 144.31 MiB free; 21.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried to reduce the GPU memory usage by changing some hyperparameters in v1-finetune_cocogit.yaml such as data.params.batch_size->1, data.params.num_workers->1, and model.params.first_stage_config.params.ddconfig.resolution->64, but the above errors still occur. Is there any way to further reduce the GPU memory usage to run train.sh? Also, how much VRAM is required to run it with default hyperparameters?

zyang-ur commented 1 year ago

The default setting would require ~32GB VRAM. Optimizing the SD models might be helpful to further reduce the required VRAM (e.g., converting the model to fp16, deepspeed, and searching existing optimized stable diffusion repos)

rfuruta commented 1 year ago

Thank you very much!

rfuruta commented 1 year ago

I tried to run the train.sh on eight A100 GPUs (each has 40GB VRAM), but still the following errors occurred.

Traceback (most recent call last): File "/ws/main.py", line 606, in model = load_model_from_config(config, opt.actual_resume) File "/ws/main.py", line 55, in load_model_from_config model.cuda() File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda return super().cuda(device=device) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda Traceback (most recent call last): File "/ws/main.py", line 606, in return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) [Previous line repeated 5 more times] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply model = load_model_from_config(config, opt.actual_resume) File "/ws/main.py", line 55, in load_model_from_config param_applied = fn(param) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 39.45 GiB total capacity; 7.11 GiB already allocated; 7.31 MiB free; 7.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/ws/main.py", line 826, in model.cuda() File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda Traceback (most recent call last): File "/ws/main.py", line 606, in return super().cuda(device=device) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 624, in _apply if trainer.global_rank == 0: NameError: name 'trainer' is not defined self._buffers[key] = fn(buf) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 39.45 GiB total capacity; 5.99 GiB already allocated; 7.31 MiB free; 6.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/ws/main.py", line 826, in if trainer.global_rank == 0: NameError: name 'trainer' is not defined model = load_model_from_config(config, opt.actual_resume) File "/ws/main.py", line 55, in load_model_from_config model.cuda() File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda return super().cuda(device=device) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) [Previous line repeated 1 more time] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/ws/main.py", line 826, in if trainer.global_rank == 0: NameError: name 'trainer' is not defined Traceback (most recent call last): File "/ws/main.py", line 606, in model = load_model_from_config(config, opt.actual_resume) File "/ws/main.py", line 55, in load_model_from_config model.cuda() File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda return super().cuda(device=device) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) [Previous line repeated 8 more times] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 39.45 GiB total capacity; 3.03 GiB already allocated; 7.31 MiB free; 3.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/ws/main.py", line 826, in if trainer.global_rank == 0: NameError: name 'trainer' is not defined Traceback (most recent call last): File "/ws/main.py", line 606, in model = load_model_from_config(config, opt.actual_resume) File "/ws/main.py", line 55, in load_model_from_config model.cuda() File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 138, in cuda return super().cuda(device=device) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) [Previous line repeated 1 more time] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/ws/main.py", line 826, in if trainer.global_rank == 0: NameError: name 'trainer' is not defined

zyang-ur commented 1 year ago

Please kindly update the batch size to 1 here: https://github.com/microsoft/ReCo/blob/ac9e7e8f638ead0d079bf952dd6275bdf98304e2/configs/reco/v1-finetune_cocogit.yaml#L80C1-L81C1 And optionally update the gradient accumulation if wish to keep the same batch size (e.g. times 8): https://github.com/microsoft/ReCo/blob/ac9e7e8f638ead0d079bf952dd6275bdf98304e2/configs/reco/v1-finetune_cocogit.yaml#L144C5-L144C28

rfuruta commented 1 year ago

Although I changed the batch size to 1 and run the train.sh on the A100 GPUs, I got the same errors.

zyang-ur commented 1 year ago

Sorry to hear that. Do you mind trying to use a single GPU, to rule out the possibility that GPU local rank is not set properly such that multiple processes try to use GPU:0? Defining a super small UNet in config file's unet_config (without loading weights) might be another way to debug. Meanwhile, I'll try to recall other possible causes. Also please feel free to post any observations/guesses for discussion. Thanks.

rfuruta commented 1 year ago

Thank you so much for your support! I found that the train.sh successfully runs with four or fewer A100 GPUs although I don't know the reasons.

zyang-ur commented 1 year ago

Could you check when running with two/four GPUs, what the memory cost looks like with nvidia-smi?

rfuruta commented 1 year ago

This is the result of nvidia-smi when running with four GPUs on a machine that has five A100 GPUs.

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 | | N/A 37C P0 246W / 400W | 31014MiB / 40960MiB | 54% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:05:00.0 Off | 0 | | N/A 38C P0 264W / 400W | 28475MiB / 40960MiB | 66% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:0D:00.0 Off | 0 | | N/A 36C P0 271W / 400W | 28465MiB / 40960MiB | 60% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:16:00.0 Off | 0 | | N/A 35C P0 256W / 400W | 28475MiB / 40960MiB | 46% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM... On | 00000000:1F:00.0 Off | 0 | | N/A 28C P0 47W / 400W | 2MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 10370 C python 28487MiB | | 0 N/A N/A 10952 C /opt/conda/bin/python 841MiB | | 0 N/A N/A 11024 C /opt/conda/bin/python 841MiB | | 0 N/A N/A 11207 C /opt/conda/bin/python 841MiB | | 1 N/A N/A 10952 C /opt/conda/bin/python 28473MiB | | 2 N/A N/A 11024 C /opt/conda/bin/python 28463MiB | | 3 N/A N/A 11207 C /opt/conda/bin/python 28473MiB | +-----------------------------------------------------------------------------+

And this is the one when running with two GPUs on the same machine.

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:03:00.0 Off | 0 | | N/A 38C P0 254W / 400W | 29330MiB / 40960MiB | 88% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:05:00.0 Off | 0 | | N/A 37C P0 247W / 400W | 28461MiB / 40960MiB | 80% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:0D:00.0 Off | 0 | | N/A 24C P0 42W / 400W | 2MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:16:00.0 Off | 0 | | N/A 24C P0 44W / 400W | 2MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM... On | 00000000:1F:00.0 Off | 0 | | N/A 26C P0 49W / 400W | 2MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 16608 C python 28487MiB | | 0 N/A N/A 17207 C /opt/conda/bin/python 841MiB | | 1 N/A N/A 17207 C /opt/conda/bin/python 28459MiB | +-----------------------------------------------------------------------------+

It looks like VRAM is enough for running with five GPUs, but CUDA out of memory errors occur when trying to run with five GPUs. Similarly, the same errors occur when trying to run with five or more GPUs on a machine that has eight A100 GPUs. I think this is not a hardware problem because I'm using a cloud-computing service.

GiftNovice commented 4 hours ago

Hello, I'm facing the same issue. Could you please provide me with a semi-precision training code? I'm encountering NaN values when using it. Thank you very much！！！！！！！！！！

microsoft / ReCo

CUDA out of memory when running train.sh #8