torralba-lab / im2recipe-Pytorch

im2recipe Pytorch implementation
MIT License
271 stars 100 forks source link

Memory requirements #8

Closed akhil2495 closed 5 years ago

akhil2495 commented 6 years ago

Can I know the memory requirements for running this, I keep running into an unexpected bus error even when I use (8GB RAM + 24GB Swap file), also the swapfile does not even get fully occupied when the error occurs. I am using a cpu version of torch to run this.

=> loading checkpoint 'model_e220_v-4.700.pth.tar' => loaded checkpoint 'model_e220_v-4.700.pth.tar' (epoch 220) /home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torchvision-0.2.1-py2.7.egg/torchvision/transforms/transforms.py:188: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead. Test loader prepared. 321 i 0 321 test.py:110: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. v = torch.autograd.Variable(input[j], volatile=True) test.py:116: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. v = torch.autograd.Variable(target[j], volatile=True) ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f97abe148d0>> ignored Traceback (most recent call last): File "test.py", line 199, in main() File "test.py", line 90, in main test(test_loader, model, criterion) File "test.py", line 127, in test output = model(input_var[0],input_var[1], input_var[2], input_var[3], input_var[4]) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/home/akhil/Documents/im2recipe/im2recipe-Pytorch/trijoint.py", line 134, in forward visual_emb = self.visionMLP(x) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 109, in forward return self.module(inputs, kwargs) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(input, kwargs) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torchvision-0.2.1-py2.7.egg/torchvision/models/resnet.py", line 76, in forward File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 301, in forward self.padding, self.dilation, self.groups) File "/home/akhil/anaconda2/envs/im2recipe/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 178, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 12487) is killed by signal: Bus error.

nhynes commented 6 years ago

Hi @makhilbabu, thanks for bringing this up. I can't say for sure what the exact memory requirements are but 32 GB should be around what you need. You can probably get away with reducing the number of workers, though. Swap won't be very useful when you're trying to share memory between processes.

And, of course you should take the advice of PyTorch and use with torch.no_grad (or use PyTorch v0.3.*)

akhil2495 commented 6 years ago

Thank you, one more thing, do you know how many cores the command python test.py --model_path=snapshots/model_e220_v-4.700.pth.tar uses by default when run on cuda, does it depend on number of workers set?

nhynes commented 6 years ago

does it depend on number of workers set?

Yes, this is the only variable that affects number of CPUs used when running on CUDA.