Out of memory error when training best model on imagenet

tremblerz commented 6 years ago

I am using V100 gpu which has 16G memory. Here is the error log-

07/10 07:05:24 PM valid 000 2.609589e+00 47.656250 76.562500
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train_imagenet.py", line 230, in <module>
    main() 
  File "train_imagenet.py", line 152, in main
    valid_acc_top1, valid_acc_top5, valid_obj = infer(valid_queue, model, criterion)
  File "train_imagenet.py", line 214, in infer
    logits, _ = model(input)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/darts/cnn/model.py", line 207, in forward
    s0, s1 = s1, cell(s0, s1, self.drop_path_prob)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/darts/cnn/model.py", line 51, in forward
    h1 = op1(h1)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/darts/cnn/operations.py", line 66, in forward
    return self.op(x)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/.torch-env/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

quark0 commented 6 years ago

Hard to tell without more details such as the pytorch version. If you use pytorch 0.4, be sure to wrap the validation scripts into torch.no_grad() as otherwise you would get OOM. I would also try smaller batch sizes and check the memory consumption during training & validation.

tremblerz commented 6 years ago

Thank you, that solves the issue.

dragen1860 commented 5 years ago

Thanks. https://github.com/dragen1860/DARTS-PyTorch Here is the darts version supporting pytorch 1.0.

quark0 / darts

Out of memory error when training best model on imagenet #14