jeong-tae / RACNN-pytorch

This is a third party implementation of RA-CNN in pytorch.
201 stars 63 forks source link

one errors in your code and a problem about cuda out of memery #7

Closed gaoyuchris closed 5 years ago

gaoyuchris commented 6 years ago

Hello, in Line 228 of ./trainer.py response_map = F.upsample(response_map, size = [resize, resize]) maybe shoud be before_upsample = Variable(response_map.unsqueeze(0)) response_map = F.upsample(before_upsample, size = [resize, resize]) response_map = response_map.data.squeeze()

More, I have a question to ask you. It have no problem when I run your code with only one gpu, however it has the "cude error: out of memory" problem when I run the code with multiple gpus, do u have the same problem , or do u know the reason?

jeong-tae commented 6 years ago

oh, it's weird. In my computer, no problem with multiple GPU. Can you tell me more detail about your env?

I have a job for living and little bit busy. Code implementation is kind of habbit so that reply can be lated. sorry for late reply

gaoyuchris commented 6 years ago

oh, thx, my env is 8 Nvidia GPU Titan X Pascal with 12GB memory, I wonder it's the model takes up too much memory, especially the structure of vgg ? When I replace the vgg with other simpler structure, it works well with multiple GPU. The model with the vgg trained by singe GPU is 2.5g. And I find that If we use multiple GPU, it will new a new model forwarding each image. While using single GPU, it only new one model for all images.

jeong-tae commented 6 years ago

as far as i know, vgg use a lot of memory than others. Reduce the last layers(fc)'s parameters and try with smaller batch size. In my case, batch size was 4? or something for 1 GPU. it was very small.

I didn't know it new a new model. hm... Did you solve this problem?

Nine9Nine commented 6 years ago

I also get the same cuda out of memory I reduce the batchsize to 1 but still not work my gpu is1080ti 11g

hubutui commented 6 years ago

@jeong-tae Could provide more detail about env? My env:

python 3.5.2
pytorch 0.4.1
torchvision 0.2.1
numpy 15.1
tensorflow 1.10.1
2 Titan Xp GPU with 12g mem

I got this error:

Traceback (most recent call last):
  File "trainer.py", line 17, in <module>
    from data import CUB200_loader
ImportError: cannot import name 'CUB200_loader'

So I add data/__init__.py:

from .CUB_loader import CUB200_loader

and then I got:

 [*] Set cuda: True
 [*] Loading dataset...
Traceback (most recent call last):
  File "trainer.py", line 306, in <module>
    train()
  File "trainer.py", line 64, in train
    apn_iter, apn_epoch, apn_steps = pretrainAPN(trainset, trainloader)
  File "trainer.py", line 214, in pretrainAPN
    _, conv5s, attens = net(images)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 122, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 127, in replicate
    return replicate(module, device_ids)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 13, in forward
    raise TypeError('Broadcast function not implemented for CPU tensors')
TypeError: Broadcast function not implemented for CPU tensors
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f0f49e1a400>>
Traceback (most recent call last):
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
    self._shutdown_workers()
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
    self.worker_result_queue.get()
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
    return ForkingPickler.loads(res)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

If I run with only one GPU, env CUDA_VISIBLE_DEVICES=1 python trainer.py:

 [*] Set cuda: True
 [*] Loading dataset...
Traceback (most recent call last):
  File "trainer.py", line 306, in <module>
    train()
  File "trainer.py", line 64, in train
    apn_iter, apn_epoch, apn_steps = pretrainAPN(trainset, trainloader)
  File "trainer.py", line 214, in pretrainAPN
    _, conv5s, attens = net(images)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 48, in forward
    scaledA_x = self.crop_resize(x, atten1 * 448)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 151, in forward
    return AttentionCropFunction.apply(images, locs)
  File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 98, in forward
    mk = (h(x-w_off) - h(x-w_end)) * (h(y-h_off) - h(y-h_end))
  File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 75, in <lambda>
    h = lambda x: 1 / (1 + torch.exp(-10 * x))
RuntimeError: _exp_out is not implemented for type torch.cuda.LongTensor
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f4838c70400>>
Traceback (most recent call last):
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
    self._shutdown_workers()
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
    self.worker_result_queue.get()
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
    return ForkingPickler.loads(res)
  File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

Any suggestion?

jeong-tae commented 6 years ago

@hubutui Sorry for late reply. i think this issue is same with https://github.com/jeong-tae/RACNN-pytorch/issues/9 refer this and let me know if conversion doesn't work by re-opening issue. Currently, i am planning to fix some issues on comming end of Sep. There are some holidays that i can work for this repository.

hubutui commented 6 years ago

@jeong-tae Yeah, sorry I didn't check the closed issues. I could fix this now, but only run with on GPU, still not with 2 GPUs. And the rank loss doesn't decrease either.

songwaimai commented 5 years ago

I also get the same cuda out of memory I reduce the batchsize to 1 but still not work my gpu is1080ti 11g

i have the same problem,have u resolved the problem?

songwaimai commented 5 years ago

my gpu is also 1080ti 11g,i have reduced the batchsize to 1.but still

/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/nn/functional.py:1890: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.") [] pre_apn_epoch[0], || pre_apn_iter 0 || pre_apn_loss: 0.0493 || Timer: 3.8996sec [] Swtich optimize parameters to Class Traceback (most recent call last): File "/home/dl2/Songly/RACNN-pytorch-master/trainer.py", line 319, in train() File "/home/dl2/Songly/RACNN-pytorch-master/trainer.py", line 135, in train logger.scalar_summary('rank_loss', new_apn_loss.item, iteration + 1) File "/home/dl2/Songly/RACNN-pytorch-master/visual/logger.py", line 16, in scalar_summary summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=value)]) TypeError: <built-in method item of Tensor object at 0x7f056977e3f0> has type builtin_function_or_method, but expected one of: int, long, float Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f05698c67f0>> Traceback (most recent call last): File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in del self._shutdown_workers() File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get return ForkingPickler.loads(res) File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused

Process finished with exit code 1

i don't know whether my gpu can't support it or other problem~~

jeong-tae commented 5 years ago

my gpu is also 1080ti 11g,i have reduced the batchsize to 1.but still

/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/nn/functional.py:1890: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.") [_] pre_apn_epoch[0], || pre_apn_iter 0 || pre_apnloss: 0.0493 || Timer: 3.8996sec [] Swtich optimize parameters to Class Traceback (most recent call last): File "/home/dl2/Songly/RACNN-pytorch-master/trainer.py", line 319, in train() File "/home/dl2/Songly/RACNN-pytorch-master/trainer.py", line 135, in train logger.scalar_summary('rank_loss', new_apn_loss.item, iteration + 1) File "/home/dl2/Songly/RACNN-pytorch-master/visual/logger.py", line 16, in scalar_summary summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=value)]) TypeError: <built-in method item of Tensor object at 0x7f056977e3f0> has type builtin_function_or_method, but expected one of: int, long, float Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f05698c67f0>> Traceback (most recent call last): File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in del self._shutdown_workers() File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get return ForkingPickler.loads(res) File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused

Process finished with exit code 1

i don't know whether my gpu can't support it or other problem~~

It seems not a problem of cuda memory. Did you install tensorboard? If you problem doesn't solved by installing tnesorboard, make new issue please

songwaimai commented 5 years ago

@jeong-tae thank u, the problem has been resolved. the procedure is running well~

LXYTSOS commented 5 years ago

I encountered the same problem, have you solved it?

jeong-tae commented 5 years ago

@LXYTSOS I am sorry about memory issue. I think this part is hard to help you. The model use multiple VGG layer so that a lots of memory is needed. I recommend that you may prepare at least 12GB memory for GPU. One thing that i can try is to use more thin layers but i am not sure that this is better to improve. The goal of this repository is to reproduce original paper in pytorch code... so it's little bit out of subject to me. This issue thread seems going to out of topic. I will close this issue. If you have any other issue, please make new issue for that.

doublemanyu commented 5 years ago

@jeong-tae thank u, the problem has been resolved. the procedure is running well~

hey girl, I met the same question as you. How did you slove it? I will appreciate you very much if you can share your solution