hengyuan-hu / bottom-up-attention-vqa

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.
GNU General Public License v3.0
753 stars 181 forks source link

getting memory error on Tesla K80 #23

Closed sujit420 closed 6 years ago

sujit420 commented 6 years ago

getting error on loading: Traceback (most recent call last): File "main.py", line 33, in train_dset = VQAFeatureDataset('train', dictionary) File "/home/sujitmishra/bottom-up-attention-vqa/dataset.py", line 120, in init self.features = np.array(hf.get('image_features')) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/home/sujitmishra/py2/local/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 690, in array arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype) MemoryError

How much gpu does it need for training?

ZhuFengdaaa commented 6 years ago

The problem is you need more Memory and swap space, which must be added up to at least 50G. Correctness: total memory must be added up to at least 80G.

sujit420 commented 6 years ago

Thanks for your response @ZhuFengdaaa . I increased my swap space with total of more than 50G. Getting other issue now: Traceback (most recent call last): File "main.py", line 45, in train(model, train_loader, eval_loader, args.epochs, args.output) File "/home/sujitmishra/bottom-up-attention-vqa/train.py", line 36, in train for i, (v, b, q, a) in enumerate(train_loader): File "/home/sujitmishra/py2/local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 417, in iter return DataLoaderIter(self) File "/home/sujitmishra/py2/local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 234, in init w.start() File "/usr/lib/python2.7/multiprocessing/process.py", line 130, in start self._popen = Popen(self) File "/usr/lib/python2.7/multiprocessing/forking.py", line 121, in init self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

How much memory does it need to train? Or do we have pertained model for doing evaluation/inference?

ZhuFengdaaa commented 6 years ago

I use htop to check the training process, and the virtual memory cost is 77.2G. You can try that again. I encountered exactly the same bug OSError: [Errno 12] Cannot allocate memory. It can be solved if you have enough virtual memory.

I have 40G physical memory so I just created 50G swap. You might need more.

sujit420 commented 6 years ago

Thanks a lot @ZhuFengdaaa . Training is pretty slow, but its running.. Will ask you if I encounter further errors.

DaddyWesker commented 6 years ago

So, if i got it right, i need to create more virtual memory? Can you tell me how to do that and as i see i need >50 gb virtual memory?

YuanEZhou commented 5 years ago

I encountered the same problem and fixed it by modifying the dataset.py file like following: dataset.py.txt