Closed gaoyuchris closed 5 years ago
oh, it's weird. In my computer, no problem with multiple GPU. Can you tell me more detail about your env?
I have a job for living and little bit busy. Code implementation is kind of habbit so that reply can be lated. sorry for late reply
oh, thx, my env is 8 Nvidia GPU Titan X Pascal with 12GB memory, I wonder it's the model takes up too much memory, especially the structure of vgg ? When I replace the vgg with other simpler structure, it works well with multiple GPU. The model with the vgg trained by singe GPU is 2.5g. And I find that If we use multiple GPU, it will new a new model forwarding each image. While using single GPU, it only new one model for all images.
as far as i know, vgg use a lot of memory than others. Reduce the last layers(fc)'s parameters and try with smaller batch size. In my case, batch size was 4? or something for 1 GPU. it was very small.
I didn't know it new a new model. hm... Did you solve this problem?
I also get the same cuda out of memory I reduce the batchsize to 1 but still not work my gpu is1080ti 11g
@jeong-tae Could provide more detail about env? My env:
python 3.5.2
pytorch 0.4.1
torchvision 0.2.1
numpy 15.1
tensorflow 1.10.1
2 Titan Xp GPU with 12g mem
I got this error:
Traceback (most recent call last):
File "trainer.py", line 17, in <module>
from data import CUB200_loader
ImportError: cannot import name 'CUB200_loader'
So I add data/__init__.py
:
from .CUB_loader import CUB200_loader
and then I got:
[*] Set cuda: True
[*] Loading dataset...
Traceback (most recent call last):
File "trainer.py", line 306, in <module>
train()
File "trainer.py", line 64, in train
apn_iter, apn_epoch, apn_steps = pretrainAPN(trainset, trainloader)
File "trainer.py", line 214, in pretrainAPN
_, conv5s, attens = net(images)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 122, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 127, in replicate
return replicate(module, device_ids)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/_functions.py", line 13, in forward
raise TypeError('Broadcast function not implemented for CPU tensors')
TypeError: Broadcast function not implemented for CPU tensors
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f0f49e1a400>>
Traceback (most recent call last):
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
self._shutdown_workers()
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
c = SocketClient(address)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
If I run with only one GPU, env CUDA_VISIBLE_DEVICES=1 python trainer.py
:
[*] Set cuda: True
[*] Loading dataset...
Traceback (most recent call last):
File "trainer.py", line 306, in <module>
train()
File "trainer.py", line 64, in train
apn_iter, apn_epoch, apn_steps = pretrainAPN(trainset, trainloader)
File "trainer.py", line 214, in pretrainAPN
_, conv5s, attens = net(images)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 48, in forward
scaledA_x = self.crop_resize(x, atten1 * 448)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 151, in forward
return AttentionCropFunction.apply(images, locs)
File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 98, in forward
mk = (h(x-w_off) - h(x-w_end)) * (h(y-h_off) - h(y-h_end))
File "~/Projects/Look Closer to See Better/RACNN-pytorch/models/RACNN.py", line 75, in <lambda>
h = lambda x: 1 / (1 + torch.exp(-10 * x))
RuntimeError: _exp_out is not implemented for type torch.cuda.LongTensor
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f4838c70400>>
Traceback (most recent call last):
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in __del__
self._shutdown_workers()
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers
self.worker_result_queue.get()
File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get
return ForkingPickler.loads(res)
File "~/Projects/virtualenv/pytorch-0.4.1/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
c = SocketClient(address)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
Any suggestion?
@hubutui Sorry for late reply. i think this issue is same with https://github.com/jeong-tae/RACNN-pytorch/issues/9 refer this and let me know if conversion doesn't work by re-opening issue. Currently, i am planning to fix some issues on comming end of Sep. There are some holidays that i can work for this repository.
@jeong-tae Yeah, sorry I didn't check the closed issues. I could fix this now, but only run with on GPU, still not with 2 GPUs. And the rank loss doesn't decrease either.
I also get the same cuda out of memory I reduce the batchsize to 1 but still not work my gpu is1080ti 11g
i have the same problem,have u resolved the problem?
my gpu is also 1080ti 11g,i have reduced the batchsize to 1.but still
/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/nn/functional.py:1890: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
[] pre_apn_epoch[0], || pre_apn_iter 0 || pre_apn_loss: 0.0493 || Timer: 3.8996sec
[] Swtich optimize parameters to Class
Traceback (most recent call last):
File "/home/dl2/Songly/RACNN-pytorch-master/trainer.py", line 319, in
Process finished with exit code 1
i don't know whether my gpu can't support it or other problem~~
my gpu is also 1080ti 11g,i have reduced the batchsize to 1.but still
/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/nn/functional.py:1890: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.") [_] pre_apn_epoch[0], || pre_apn_iter 0 || pre_apnloss: 0.0493 || Timer: 3.8996sec [] Swtich optimize parameters to Class Traceback (most recent call last): File "/home/dl2/Songly/RACNN-pytorch-master/trainer.py", line 319, in train() File "/home/dl2/Songly/RACNN-pytorch-master/trainer.py", line 135, in train logger.scalar_summary('rank_loss', new_apn_loss.item, iteration + 1) File "/home/dl2/Songly/RACNN-pytorch-master/visual/logger.py", line 16, in scalar_summary summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=value)]) TypeError: <built-in method item of Tensor object at 0x7f056977e3f0> has type builtin_function_or_method, but expected one of: int, long, float Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f05698c67f0>> Traceback (most recent call last): File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 399, in del self._shutdown_workers() File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/usr/lib/python3.5/multiprocessing/queues.py", line 345, in get return ForkingPickler.loads(res) File "/home/dl2/Songly/RACNN-pytorch-master/venv/lib/python3.5/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.5/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client c = SocketClient(address) File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused
Process finished with exit code 1
i don't know whether my gpu can't support it or other problem~~
It seems not a problem of cuda memory. Did you install tensorboard? If you problem doesn't solved by installing tnesorboard, make new issue please
@jeong-tae thank u, the problem has been resolved. the procedure is running well~
I encountered the same problem, have you solved it?
@LXYTSOS I am sorry about memory issue. I think this part is hard to help you. The model use multiple VGG layer so that a lots of memory is needed. I recommend that you may prepare at least 12GB memory for GPU. One thing that i can try is to use more thin layers but i am not sure that this is better to improve. The goal of this repository is to reproduce original paper in pytorch code... so it's little bit out of subject to me. This issue thread seems going to out of topic. I will close this issue. If you have any other issue, please make new issue for that.
@jeong-tae thank u, the problem has been resolved. the procedure is running well~
hey girl, I met the same question as you. How did you slove it? I will appreciate you very much if you can share your solution
Hello, in Line 228 of ./trainer.py
response_map = F.upsample(response_map, size = [resize, resize])
maybe shoud bebefore_upsample = Variable(response_map.unsqueeze(0))
response_map = F.upsample(before_upsample, size = [resize, resize])
response_map = response_map.data.squeeze()
More, I have a question to ask you. It have no problem when I run your code with only one gpu, however it has the "cude error: out of memory" problem when I run the code with multiple gpus, do u have the same problem , or do u know the reason?