maum-ai / faceshifter

Unofficial PyTorch Implementation for FaceShifter (https://arxiv.org/abs/1912.13457)
BSD 3-Clause "New" or "Revised" License
610 stars 114 forks source link

CUDA out of memory. #9

Closed vmr013 closed 3 years ago

vmr013 commented 3 years ago

Hi, can someone tell me what can I do to fix this issues? This is what I get when running it on local machine or google colab.

2020-11-05 11:56:07.097272: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: WORLD_SIZE environment variable (2) is not equal to the computed world size (1). Ignored.
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=ddp
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
  warnings.warn(*args, **kwargs)

  | Name     | Type                        | Params
---------------------------------------------------------
0 | G        | ADDGenerator                | 372 M 
1 | E        | MultilevelAttributesEncoder | 67 M  
2 | D        | MultiscaleDiscriminator     | 8 M   
3 | Z        | ResNet                      | 43 M  
4 | Loss_GAN | GANLoss                     | 0     
5 | Loss_E_G | AEI_Loss                    | 0     
Validation sanity check: 0it [00:00, ?it/s]/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:2494: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  "See the documentation of nn.Upsample for details.".format(mode))
Epoch 0:   0% 0/5000250 [00:00<?, ?it/s] Traceback (most recent call last):
  File "aei_trainer.py", line 62, in <module>
    main(args)
  File "aei_trainer.py", line 40, in main
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 491, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 844, in run_training_batch
    self.hiddens
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 1015, in optimizer_closure
    hiddens)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 1197, in training_forward
    output = self.model(*args)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/overrides/data_parallel.py", line 170, in forward
    output = self.module.training_step(*inputs[0], **kwargs[0])
  File "/content/faceshifter/aei_net.py", line 54, in training_step
    output, z_id, output_z_id, feature_map, output_feature_map = self(target_img, source_img)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/aei_net.py", line 44, in forward
    output = self.G(z_id, feature_map)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/model/AEINet.py", line 132, in forward
    x = self.model["layer_7"](x, z_att[7], z_id)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/model/AEINet.py", line 98, in forward
    x1 = self.activation(self.add1(h_in, z_att, z_id))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/faceshifter/model/AEINet.py", line 72, in forward
    h_out = (1-m)*a + m*i
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.73 GiB total capacity; 12.56 GiB already allocated; 45.88 MiB free; 1.17 GiB cached)
Epoch 0:   0%|          | 0/5000250 [00:36<?, ?it/s]
m-pektas commented 3 years ago

Your gpu memory not enough for training this model. Firstly, You can try to decrease your batch size.

vmr013 commented 3 years ago

I tried batch_size with values 8,4,2,1. For each there is this memory error.

This is model configuration from train.yaml

model:
  learning_rate_E_G: 4e-4
  learning_rate_D: 4e-4

  beta1: 0
  beta2: 0.999

  batch_size: 16

  num_workers: 16
  grad_clip: 0.0

Do you see the issue? I don't see it

m-pektas commented 3 years ago

I got this error repeatedly before. Reason of this error is always that my gpu memory not enough. When start training, If you use Linux, you can check your gpu memory with run "watch nvidia-smi" command in terminal.

vmr013 commented 3 years ago

Thanks