jeong-tae / RACNN-pytorch

This is a third party implementation of RA-CNN in pytorch.
201 stars 63 forks source link

There is someyhing error ,and I changed logits,_, _= net(images) into logits, cc, aa= net(images) as it used to make error,but now it has the same error #3

Closed deepblue0822 closed 6 years ago

deepblue0822 commented 6 years ago

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "trainer2.py", line 305, in train() File "trainer2.py", line 94, in train logits, cc, aa= net(images) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, kwargs) File "/home/ubuntu/Desktop/ww/RACNN-pytorch-master/models/RACNN.py", line 55, in forward conv5_4_A = self.b2.features:-1 File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/batchnorm.py", line 49, in forward self.training or not self.track_running_stats, self.momentum, self.eps) File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1194, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58 Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f10b8d49dd8>> Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 349, in del self._shutdown_workers() File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers self.worker_result_queue.get() File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get return ForkingPickler.loads(res) File "/usr/local/lib/python3.5/dist-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused

jeong-tae commented 6 years ago

can you down the batch size more smaller?

deepblue0822 commented 6 years ago

ok,I will try,thank you

jeong-tae commented 6 years ago

I can't catch up How reproduce that error.

Normally, StopIteration error is occur with loader. If you change the dataloader, plz carefully check with it first.

I am not recommend to use pretrainAPN because it is not fully implemented. Just comment out pretrainAPN part and try again.

Please don't reply this message in the e-mail. Instead of mailing, plz use git issue dashboard to get support from others also.

Thanks

Jeongtae, Lee

-----Original Message----- From: "deepblue0822"notifications@github.com To: "jeong-tae/RACNN-pytorch"RACNN-pytorch@noreply.github.com; Cc: "jeong-tae"make8286@naver.com; "Comment"comment@noreply.github.com; Sent: 2018-06-29 (금) 12:32:33 Subject: Re: [jeong-tae/RACNN-pytorch] There is someyhing error ,and I changed logits,, = net(images) into logits, cc, aa= net(images) as it used to make error,but now it has the same error (#3)

when I down the batchsize to 2 ,and the error is Traceback (most recent call last): File "trainer2.py", line 310, in train() File "trainer2.py", line 61, in train apn_iter, apn_epoch, apn_steps = pretrainAPN(trainset, trainloader) File "trainer2.py", line 206, in pretrainAPN images, labels = next(batch_iterator) File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 276, in next raise StopIteration StopIteration — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

deepblue0822 commented 6 years ago

OK,Thank you very much.I tried down the batch_size and it worked but the accuracy rate is very low,and I think maybe my memory is not enough.And I will try to do something to improve it.Whatever thank you very much.

jeong-tae commented 6 years ago

I hope you get good results. Even in my experiments, i can't load more than batch size 4. If you find any error or improvements, let me know and share it please.