Closed akhil2495 closed 5 years ago
Hi @makhilbabu, thanks for bringing this up. I can't say for sure what the exact memory requirements are but 32 GB should be around what you need. You can probably get away with reducing the number of workers, though. Swap won't be very useful when you're trying to share memory between processes.
And, of course you should take the advice of PyTorch and use with torch.no_grad
(or use PyTorch v0.3.*)
Thank you, one more thing, do you know how many cores the command python test.py --model_path=snapshots/model_e220_v-4.700.pth.tar uses by default when run on cuda, does it depend on number of workers set?
does it depend on number of workers set?
Yes, this is the only variable that affects number of CPUs used when running on CUDA.
Can I know the memory requirements for running this, I keep running into an unexpected bus error even when I use (8GB RAM + 24GB Swap file), also the swapfile does not even get fully occupied when the error occurs. I am using a cpu version of torch to run this.
=> loading checkpoint 'model_e220_v-4.700.pth.tar' => loaded checkpoint 'model_e220_v-4.700.pth.tar' (epoch 220) /home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torchvision-0.2.1-py2.7.egg/torchvision/transforms/transforms.py:188: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead. Test loader prepared. 321 i 0 321 test.py:110: UserWarning: volatile was removed and now has no effect. Use
main()
File "test.py", line 90, in main
test(test_loader, model, criterion)
File "test.py", line 127, in test
output = model(input_var[0],input_var[1], input_var[2], input_var[3], input_var[4])
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, kwargs)
File "/home/akhil/Documents/im2recipe/im2recipe-Pytorch/trijoint.py", line 134, in forward
visual_emb = self.visionMLP(x)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, *kwargs)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 109, in forward
return self.module(inputs, kwargs)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, kwargs)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, *kwargs)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(input, kwargs)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torchvision-0.2.1-py2.7.egg/torchvision/models/resnet.py", line 76, in forward
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 178, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 12487) is killed by signal: Bus error.
with torch.no_grad():
instead. v = torch.autograd.Variable(input[j], volatile=True) test.py:116: UserWarning: volatile was removed and now has no effect. Usewith torch.no_grad():
instead. v = torch.autograd.Variable(target[j], volatile=True) ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f97abe148d0>> ignored Traceback (most recent call last): File "test.py", line 199, in