Closed StrongerzZ closed 4 years ago
@StrongerzZ I've seen some articles saying it is not straightforward to use multiple GPU when training SSD. And I myself have been using only a single GPU for SSD training. But let me give it a try. I will look into this.
Good to hear that! @ponta256
@StrongerzZ Could you try it by replacing fssd512_resnext.py with the following gist? https://gist.github.com/ponta256/c174a1820d20ffe5d9d59c59c327d45c
It seems to work for me. Tried with three GPUs and trained for three epochs. No errors. Prediction using the trained weight gave me a reasonable result.
I think the cause of the problem is the same as https://github.com/pytorch/pytorch/issues/8637 And I've adjusted my code accordingly.
I have no multi-GPU devices available recently. I will verify your code later. If feasible, I will close the issue. If there are still problems, I will give you feedback. Anyway, thanks again! @ponta256
When I see the SSDAugmentation function, I have some questions: 1) Why is the input image not normalized between 0 and 1 with 255? This leads to a large input value of the network. 2) Why only mean is used, and std is not used to normalize the input data? Will this cause some problems? Can you help me answer it? @ponta256
@StrongerzZ Thank you for pointing it out. Those are remains of my old dirty hack to integrate some augmentation features if I recall correctly. I am sure it should work just fine if you 1) divide by 255, 2) subtract mean and 3) divide by SD as it is usually done.
Also I noticed the NMS included in this repo is very slow and can easily speed up, say 10 times faster. I will try find some time to clean up the codes.
@ponta256 Thank you for your answer and look forward to your NMS improvement. : )
@StrongerzZ Could you try it by replacing fssd512_resnext.py with the following gist? https://gist.github.com/ponta256/c174a1820d20ffe5d9d59c59c327d45c
It seems to work for me. Tried with three GPUs and trained for three epochs. No errors. Prediction using the trained weight gave me a reasonable result.
I think the cause of the problem is the same as pytorch/pytorch#8637 And I've adjusted my code accordingly.
I verified the code under multi-GPU parallelism, everything works fine, thanks again! @ponta256
Traceback (most recent call last): File "E:/NNDL_pytorch/fssd-resnext-voc-coco-master/train_fssd_resnext.py", line 459, in
train()
File "E:/NNDL_pytorch/fssd-resnext-voc-coco-master/train_fssd_resnext.py", line 240, in train
out = net(images)
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, kwargs)
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker
output = module(*input, *kwargs)
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(input, kwargs)
File "E:\NNDL_pytorch\fssd-resnext-voc-coco-master\fssd512_resnext.py", line 293, in forward
File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\module.py", line 547, in call result = self.forward(*input, **kwargs) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\conv.py", line 343, in forward return self.conv2d_forward(input, self.weight) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\conv.py", line 340, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution) Can you help me solve it? @ponta256