speedinghzl / pytorch-segmentation-toolbox

PyTorch Implementations for DeeplabV3 and PSPNet
MIT License
768 stars 167 forks source link

No such file or directory #7

Closed shivangbaveja closed 5 years ago

shivangbaveja commented 5 years ago

I was trying to run the model without any modifications on google cloud with V100 GPU. I got following error:

RuntimeError: CUDA Error encountered in <function CompiledLib.bn_mean_var_cuda at 0x7fd263866c80> Traceback (most recent call last): File "evaluate.py", line 253, in main() File "evaluate.py", line 198, in main saved_state_dict = torch.load(args.restore_from) File "/opt/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 356, in load f = open(f, 'rb')

Is it related to training on V100 GPU. Currently I have only one GPU attached.

speedinghzl commented 5 years ago

@shivangbaveja I do not see any errors about "No such file or directory" in your error. Have you checked the trained model path is correct?

amiltonwong commented 5 years ago

Hi, @speedinghzl ,

I also encounter the similar issue:

321300 images are loaded!
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 251, in <module>
    main()
  File "train.py", line 215, in main
    preds = model(images)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code8/pytorch-segmentation-toolbox/networks/pspnet.py", line 145, in forward
    x = self.layer3(x)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code8/pytorch-segmentation-toolbox/networks/pspnet.py", line 42, in forward
    out = self.conv1(x)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
Traceback (most recent call last):
  File "evaluate.py", line 253, in <module>
    main()
  File "evaluate.py", line 198, in main
    saved_state_dict = torch.load(args.restore_from)
  File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/serialization.py", line 301, in load
    f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'snapshots/CS_scenes_40000.pth'
(tf1.3) root@milton-ThinkCentre-M93p:/data/code8/pytorch-segmentation-toolbox#

First, it prompts me out of memory error. How large GPU is required? My GPU is TITAN Xp (12 GB).

Second, No such file or directory: 'snapshots/CS_scenes_40000.pth' . Is CS_scenes_40000.pth needed? I only download resnet101-imagenet.pth into dataset folder as guided.

Any suggestion to fix such problems?

(P.S. My system is ubuntu 16.04, pytorch 0.4.0, cuda 8.0 and cudnn7)

shivangbaveja commented 5 years ago

In my case, the error came when I was training on V-100 GPU. I tried running on a different GPU (I think P50) and that solved the issue. I am no longer working on this so closing this issue.

speedinghzl commented 5 years ago

@amiltonwong The first error leads to the second one. you need 4 x 12g GPUs to run this repo If you want to reproduce high performance (~78.5% mIOU on val set). Or, you just need to set a small batch size (2 images per GPU).