An error in Multi-GPU parallel training

StrongerzZ commented 4 years ago

Traceback (most recent call last): File "E:/NNDL_pytorch/fssd-resnext-voc-coco-master/train_fssd_resnext.py", line 459, in train() File "E:/NNDL_pytorch/fssd-resnext-voc-coco-master/train_fssd_resnext.py", line 240, in train out = net(images) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\module.py", line 547, in call result = self.forward(*input, kwargs) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply output.reraise() File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 1 on device 1. Original Traceback (most recent call last): File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\module.py", line 547, in call result = self.forward(input, kwargs)

File "E:\NNDL_pytorch\fssd-resnext-voc-coco-master\fssd512_resnext.py", line 293, in forward

x = l(x)

File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\module.py", line 547, in call result = self.forward(*input, **kwargs) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\conv.py", line 343, in forward return self.conv2d_forward(input, self.weight) File "D:\Anaconda3\envs\pytorch_envs\lib\site-packages\torch\nn\modules\conv.py", line 340, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution) Can you help me solve it? @ponta256

ponta256 commented 4 years ago

@StrongerzZ I've seen some articles saying it is not straightforward to use multiple GPU when training SSD. And I myself have been using only a single GPU for SSD training. But let me give it a try. I will look into this.

StrongerzZ commented 4 years ago

Good to hear that! @ponta256

ponta256 commented 4 years ago

@StrongerzZ Could you try it by replacing fssd512_resnext.py with the following gist? https://gist.github.com/ponta256/c174a1820d20ffe5d9d59c59c327d45c

It seems to work for me. Tried with three GPUs and trained for three epochs. No errors. Prediction using the trained weight gave me a reasonable result.

I think the cause of the problem is the same as https://github.com/pytorch/pytorch/issues/8637 And I've adjusted my code accordingly.

StrongerzZ commented 4 years ago

I have no multi-GPU devices available recently. I will verify your code later. If feasible, I will close the issue. If there are still problems, I will give you feedback. Anyway, thanks again! @ponta256

StrongerzZ commented 4 years ago

When I see the SSDAugmentation function, I have some questions: 1) Why is the input image not normalized between 0 and 1 with 255? This leads to a large input value of the network. 2) Why only mean is used, and std is not used to normalize the input data? Will this cause some problems? Can you help me answer it? @ponta256

ponta256 commented 4 years ago

@StrongerzZ Thank you for pointing it out. Those are remains of my old dirty hack to integrate some augmentation features if I recall correctly. I am sure it should work just fine if you 1) divide by 255, 2) subtract mean and 3) divide by SD as it is usually done.

Also I noticed the NMS included in this repo is very slow and can easily speed up, say 10 times faster. I will try find some time to clean up the codes.

StrongerzZ commented 4 years ago

@ponta256 Thank you for your answer and look forward to your NMS improvement. : )

StrongerzZ commented 4 years ago

@StrongerzZ Could you try it by replacing fssd512_resnext.py with the following gist? https://gist.github.com/ponta256/c174a1820d20ffe5d9d59c59c327d45c

It seems to work for me. Tried with three GPUs and trained for three epochs. No errors. Prediction using the trained weight gave me a reasonable result.

I think the cause of the problem is the same as pytorch/pytorch#8637 And I've adjusted my code accordingly.

I verified the code under multi-GPU parallelism, everything works fine, thanks again! @ponta256

ponta256 / fssd-resnext-voc-coco

An error in Multi-GPU parallel training #2

File "E:\NNDL_pytorch\fssd-resnext-voc-coco-master\fssd512_resnext.py", line 293, in forward