Open Carl-Lei opened 6 years ago
I d suggest you to temporarily remove dataparallel and set batch size to 1 so that you can manually debug it. The size 208 definitely fit for a 1080Ti card.
我在main.py的第96行后面添加了一句del output 问题解决了。不过在运行到第65个文件的时候,又出现了错误。
Traceback (most recent call last):
File "main.py", line 349, in
Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x00000196E28DC3C8>>
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 349, in del
self._shutdown_workers()
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 328, in _shutdown_workers
self.worker_result_queue.get()
File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 86, in rebuild_storage_filename
storage = cls._new_shared_filename(manager, handle, size)
RuntimeError: Couldn't open shared event:
我的也是这个问题,在执行test的时候出现错误,提示时out of memory, 但我训练的时候是可以通过的啊。我也尝试将test时的batch_size减小,但还是不行。不知道您解决了没有?希望可以指教一下,万分感谢@Carl-Lei
I met the same problem too.
you can add 'with torch.no_grad():' before 'input = Variable(data[splitlist[i]:splitlist[i+1]], volatile = True).cuda() inputcoord = Variable(coord[splitlist[i]:splitlist[i+1]], volatile = True).cuda()'
Under Pytorch 1.x+, you should use the following codes:
def test(data_loader, net, get_pbb, save_dir, config): ... use_cuda = torch.cuda.is_available() device = torch.device("cuda" if use_cuda else "cpu") .. for i_name, (data, target, coord, nzhw) in enumerate(data_loader): ... input = data[splitlist[i] : splitlist[i + 1]].to(device, non_blocking=True) inputcoord = coord[splitlist[i] : splitlist[i + 1]].to(device, non_blocking=True) with torch.no_grad(): if isfeat: output,feature = net(input, inputcoord) featurelist.append(feature.data.cpu().numpy()) else: output = net(input, inputcoord) ....
@lfz 你好,我在训练网络的时候,detector模型的训练部分可以实现,但是在执行test的时候出现错误,提示时out of memory, 但我训练的时候是可以通过的啊。我也尝试将test时的batch_size减小,但还是不行。 test函数里面 for i in range(len(splitlist)-1): input = Variable(data[splitlist[i]:splitlist[i+1]], volatile = True).cuda() inputcoord = Variable(coord[splitlist[i]:splitlist[i+1]], volatile = True).cuda() if isfeat: output,feature = net(input,inputcoord) featurelist.append(feature.data.cpu().numpy()) else: output = net(input,inputcoord) outputlist.append(output.data.cpu().numpy())
i=0时可以执行,当i=1时报错
错误信息: Traceback (most recent call last): File "main.py", line 353, in
main()
File "main.py", line 122, in main
test(test_loader, net, get_pbb, save_dir,config)
File "main.py", line 300, in test
output = net(input,inputcoord)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 114, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 65, in parallel_apply
raise output
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 41, in _worker
output = module(*input, *kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(input, kwargs)
File "D:\mydsb\dsb_test\training\detector\res18.py", line 94, in forward
out = self.preBlock(x)#16
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, *kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\container.py", line 91, in forward
input = module(input)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 421, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at c:\programdata\miniconda3\conda-bld\pytorch_1524543037166\work\aten\src\thc\generic/THCStorage.cu:58
我用的是两块GTX1080Ti,下面时bash文件 cd detector eps=100 CUDA_VISIBLE_DEVICES=0,1 python main.py --model res18 -b 4 --epochs $eps --save-dir res18 CUDA_VISIBLE_DEVICES=0,1 python main.py --model res18 -b 2 --resume results/res18/$eps.ckpt --test 1 cp results/res18/$eps.ckpt ../../model/detector2.ckpt
我看训练时图像尺寸为128128128,测试时的图像尺寸为208208208,是不是和这个有关系啊?