Unable to train due to CUDA out of memory error in Google Colab with ultra low resolution pics

sainatarajan commented 4 years ago

Hi, thanks for the repo. However, I am not able to run my training on the cityscapes dataset. I have around 50 images for training and about 10 for validation and testing each. I have reduced the image resolution to 128x128 and still, it gives the CUDA out of memory error. I am running this on Google Colab which has 12 GB of GPU memory. Can you tell me what I should do to be able to run this model? Any changes that have to be tweaked? @shubhaminnani @tovacinni @varunjampani @davidjesusacu @ShreyasSkandanS

tovacinni commented 4 years ago

Something you may want to try is seeing if you can delete some variables as soon as they are done being used to clear up memory early. We unfortunately didn't try to run this with low-memory GPUs, so the optimizations in terms of memory are likely sub-optimal. In a future version (when I have more time), I can try to optimize the memory usage more.

sainatarajan commented 4 years ago

Thank you for your reply. The model looks very complicated to play around with. It will be difficult to revert back any changes I make. If you could help me do this, it would be of great help. However, I have only 50 images for training and I think Google Colab should be able to handle this small load with 12 GB of GPU memory. Even I tried with 64*64 resolution and still it failed.

shubhaminnani commented 4 years ago

try reducing crop size. number of images has nothing to do with model. it just take small time to train if images are less. try using 16gb gpu it works on 16gb gpu.

sainatarajan commented 4 years ago

@shubhaminnani Thank you. I had set the crop size to 360, the training went further ahead and stopped at this error. Can you tell me why? Here is the stack trace:

Traceback (most recent call last):
  File "train.py", line 383, in <module>
    main()
  File "train.py", line 154, in main
    train(train_loader, net, criterion, optim, epoch, writer)
  File "train.py", line 233, in train
    main_loss = net(inputs, gts=mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/gdrive/My Drive/gscnn/network/gscnn.py", line 327, in forward
    return self.criterion((seg_out, edge_out), gts)              
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/gdrive/My Drive/gscnn/loss.py", line 161, in forward
    return self.nll_loss(F.log_softmax(inputs), targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1314, in log_softmax
    dim = _get_softmax_dim('log_softmax', input.dim(), _stacklevel)
AttributeError: 'tuple' object has no attribute 'dim'

nv-tlabs / GSCNN

Unable to train due to CUDA out of memory error in Google Colab with ultra low resolution pics #26