Closed shivangbaveja closed 5 years ago
@shivangbaveja I do not see any errors about "No such file or directory" in your error. Have you checked the trained model path is correct?
Hi, @speedinghzl ,
I also encounter the similar issue:
321300 images are loaded!
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 251, in <module>
main()
File "train.py", line 215, in main
preds = model(images)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/data/code8/pytorch-segmentation-toolbox/networks/pspnet.py", line 145, in forward
x = self.layer3(x)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/data/code8/pytorch-segmentation-toolbox/networks/pspnet.py", line 42, in forward
out = self.conv1(x)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
Traceback (most recent call last):
File "evaluate.py", line 253, in <module>
main()
File "evaluate.py", line 198, in main
saved_state_dict = torch.load(args.restore_from)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/serialization.py", line 301, in load
f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'snapshots/CS_scenes_40000.pth'
(tf1.3) root@milton-ThinkCentre-M93p:/data/code8/pytorch-segmentation-toolbox#
First, it prompts me out of memory
error. How large GPU is required? My GPU is TITAN Xp (12 GB).
Second, No such file or directory: 'snapshots/CS_scenes_40000.pth'
. Is CS_scenes_40000.pth
needed? I only download resnet101-imagenet.pth
into dataset
folder as guided.
Any suggestion to fix such problems?
(P.S. My system is ubuntu 16.04, pytorch 0.4.0, cuda 8.0 and cudnn7)
In my case, the error came when I was training on V-100 GPU. I tried running on a different GPU (I think P50) and that solved the issue. I am no longer working on this so closing this issue.
@amiltonwong The first error leads to the second one. you need 4 x 12g GPUs to run this repo If you want to reproduce high performance (~78.5% mIOU on val set). Or, you just need to set a small batch size (2 images per GPU).
I was trying to run the model without any modifications on google cloud with V100 GPU. I got following error:
RuntimeError: CUDA Error encountered in <function CompiledLib.bn_mean_var_cuda at 0x7fd263866c80> Traceback (most recent call last): File "evaluate.py", line 253, in
main()
File "evaluate.py", line 198, in main
saved_state_dict = torch.load(args.restore_from)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 356, in load
f = open(f, 'rb')
Is it related to training on V100 GPU. Currently I have only one GPU attached.