marcoamonteiro / pi-GAN

416 stars 76 forks source link

Cuda OOM error during training #4

Closed athenas-lab closed 3 years ago

athenas-lab commented 3 years ago

Hi,

I tried to train the code on the CARLA dataset. But I am getting Cuda out of memory error. These are the things I have tried so far:
1) I have tried running on a single as well multiple 2080 Ti GpuS (specified using CUDA_VISIBLE_DEVICES), each with 11GB memory, but it still generates OOM error. 2) I tried on 3090 GPU, but the code generates errors on 3090 GPU (that are not related to Cuda OOM error) 3) I have also tried to reduce the batch size for the Carla dataset in curriculum.py from 30 to 10 as shown below. But I still get the OOM error when I run on a single or multiple 2080Ti GPUs.

CARLA = { 0: {'batch_size': 10, 'num_steps': 48, 'img_size': 32, 'batch_split': 1, 'gen_lr': 4e-5, 'disc_lr': 4e-4}, int(10e3): {'batch_size': 14, 'num_steps': 48, 'img_size': 64, 'batch_split': 2, 'gen_lr': 2e-5, 'disc_lr': 2e-4}, int(55e3): {'batch_size': 10, 'num_steps': 48, 'img_size': 128, 'batch_split': 5, 'gen_lr': 10e-6, 'disc_lr': 10e-5}, int(200e3): {},

Is there anything else I can do to fix the OOM error?

thanks

marcoamonteiro commented 3 years ago

Hi,

How many 2080 GPUs did you try running on concurrently? We trained our models with 48GB of GPU memory.

Could you try increasing the batch_split on the first CARLA step to 4? That'll divide the batch into multiple runs and reduce memory usage.

athenas-lab commented 3 years ago

Hi, 1) I tried running on 2 to 6 2080 Ti gpus (so from 22 GB to 66 GB). I tried batch_size ranging from 6 to 30. I also tried with batch_split 1 and 4. But in each case I got Cuda OOM. The issue appears to be in siren.py as noted below.

Progress to next stage: 0%| | 0/10000 [00:16<?, ?it/s] Traceback (most recent call last): File "train.py", line 400, in mp.spawn(train, args=(num_gpus, opt), nprocs=num_gpus, join=True)

-- Process 0 terminated with the following error: Traceback (most recent call last): File "python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "pi_gan/train.py", line 263, in train gen_imgs, gen_positions = generator_ddp(subset_z, metadata) File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(inputs[0], kwargs[0]) File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "pi_gan/generators/generators.py", line 49, in forward coarse_output = self.siren(transformed_points, z, ray_directions=transformed_ray_directions_expanded).reshape(batch_size, img_size img_size, num_steps, 4) File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "pi_gan/siren/siren.py", line 133, in forward return self.forward_with_frequencies_phase_shifts(input, frequencies, phase_shifts, ray_directions, *kwargs) File "pi_gan/siren/siren.py", line 143, in forward_with_frequencies_phase_shifts x = layer(x, frequencies[..., start:end], phase_shifts[..., start:end]) File "python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, *kwargs) File "pi_gan/siren/siren.py", line 94, in forward return torch.sin(freq x + phase_shift) RuntimeError: CUDA out of memory. Tried to allocate 720.00 MiB (GPU 0; 10.76 GiB total capacity; 7.26 GiB already allocated; 441.44 MiB free; 8.13 GiB reserved in total by PyTorch)

I have only made changes to the first line below. For the remaining 3 steps I am retaining the original values. SHould they be changed based on the parameters in the 1st line? 0: {'batch_size': 6, 'num_steps': 48, 'img_size': 32, 'batch_split': 1, 'gen_lr': 4e-5, 'disc_lr': 4e-4}, int(10e3): {'batch_size': 14, 'num_steps': 48, 'img_size': 64, 'batch_split': 2, 'gen_lr': 2e-5, 'disc_lr': 2e-4}, int(55e3): {'batch_size': 10, 'num_steps': 48, 'img_size': 128, 'batch_split': 5, 'gen_lr': 10e-6, 'disc_lr': 10e-5}, int(200e3): {},

zzw-zwzhang commented 3 years ago

Using 11798 MiB:

0: {'batch_size': 28 * 2, 'num_steps': 12, 'img_size': 64, 'batch_split': 8, 'gen_lr': 6e-5, 'disc_lr': 2e-4}, int(200e3): {},

marcoamonteiro commented 3 years ago

Thanks for the reply @zwzhang121 . Sounds like that curriculum worked for you?

@athena913 we've noticed that when you split across multiple GPUs one of the GPU's will need a little more memory than if you were training on just one GPU. Given the above curriculum worked for zwhang121 I'd recommend increasing the batch_split and seeing if it works.