meijieru / crnn.pytorch

Convolutional recurrent network in pytorch
MIT License
2.38k stars 658 forks source link

when using Multi-GPU training , get "all tensors must be on devices[0]" #34

Closed Lzc6996 closed 6 years ago

Lzc6996 commented 7 years ago

what should i do?

meijieru commented 7 years ago

Could you provide details of the problem? Thanks.

Lzc6996 commented 7 years ago
  batch_index = random_start + torch.range(0, self.batch_size - 1)
Traceback (most recent call last):
  File "crnn_main.py", line 220, in <module>
    cost = trainBatch(crnn, criterion, optimizer)
  File "crnn_main.py", line 195, in trainBatch
    preds = crnn(image)
  File "/gruntdata/DL_dataset/steven.lzc/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/gruntdata/DL_dataset/steven.lzc/workspace/crnn.pytorch.armor/models/crnn.py", line 86, in forward
    output = utils.data_parallel(self.rnn, conv, self.ngpu)
  File "/gruntdata/DL_dataset/steven.lzc/workspace/crnn.pytorch.armor/models/utils.py", line 10, in data_parallel
    output = nn.parallel.data_parallel(model, input, range(ngpu))
  File "/gruntdata/DL_dataset/steven.lzc/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 105, in data_parallel
    outputs = parallel_apply(replicas, inputs, module_kwargs)
  File "/gruntdata/DL_dataset/steven.lzc/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 46, in parallel_apply
    raise output
RuntimeError: all tensors must be on devices[0]

i got the error when i use --ngpu 2

YoungMiao commented 7 years ago

@Lzc6996 have you solve this problem? thanks

XuLiangFRDC commented 7 years ago

I also find this problem.
My OS is CentOS Linux release 7.2.1511 (Core), x86_64. My server has four Tesla P40 GPU cards.

My command is: python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_test_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --ngpu 2 --keep_ratio --random_sample

And error message is: Namespace(adadelta=False, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=2, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_test_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2) Random Seed: 7224 CRNN ( (cnn): Sequential ( (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu0): ReLU (inplace) (pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu1): ReLU (inplace) (pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) (relu2): ReLU (inplace) (conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu3): ReLU (inplace) (pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu4): ReLU (inplace) (conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu5): ReLU (inplace) (pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1)) (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu6): ReLU (inplace) ) (rnn): Sequential ( (0): BidirectionalLSTM ( (rnn): LSTM(512, 256, bidirectional=True) (embedding): Linear (512 -> 256) ) (1): BidirectionalLSTM ( (rnn): LSTM(256, 256, bidirectional=True) (embedding): Linear (512 -> 37) ) ) ) Traceback (most recent call last): File "crnn_main.py", line 197, in cost = trainBatch(crnn, criterion, optimizer) File "crnn_main.py", line 180, in trainBatch preds = crnn(image) File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(*input, kwargs) File "/home/xuliang/CRNN_pytorch/crnn.pytorch/models/crnn.py", line 85, in forward output = utils.data_parallel(self.rnn, conv, self.ngpu) File "/home/xuliang/CRNN_pytorch/crnn.pytorch/models/utils.py", line 10, in data_parallel output = nn.parallel.data_parallel(model, input, range(ngpu)) File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 105, in data_parallel outputs = parallel_apply(replicas, inputs, module_kwargs) File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 46, in parallel_apply raise output RuntimeError: all tensors must be on devices[0]**

How to solve this problem?

meijieru commented 7 years ago

@XuLiangFRDC Have you try the updated version?

XuLiangFRDC commented 7 years ago

@meijieru I updated the code to the current version. Now a different problem appeared:
The command is: python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --ngpu 4 --adadelta --keep_ratio --random_sample

And error message is: Namespace(adadelta=True, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=4, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2) Random Seed: 7716 CRNN ( (cnn): Sequential ( (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu0): ReLU (inplace) (pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu1): ReLU (inplace) (pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) (relu2): ReLU (inplace) (conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu3): ReLU (inplace) (pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu4): ReLU (inplace) (conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu5): ReLU (inplace) (pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1)) (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu6): ReLU (inplace) ) (rnn): Sequential ( (0): BidirectionalLSTM ( (rnn): LSTM(512, 256, bidirectional=True) (embedding): Linear (512 -> 256) ) (1): BidirectionalLSTM ( (rnn): LSTM(256, 256, bidirectional=True) (embedding): Linear (512 -> 37) ) ) ) [0/25][500/112885] Loss: 16.009935 Start val Traceback (most recent call last): File "crnn_main.py", line 207, in val(crnn, test_dataset, criterion) File "crnn_main.py", line 158, in val sim_preds = converter.decode(preds.data, preds_size.data, raw=False) File "/home/xuliang/CRNN_pytorch_v2/crnn.pytorch/utils.py", line 51, in decode t[index:index + l], torch.IntTensor([l]), raw=raw)) ValueError: result of slicing is an empty tensor

Thanks for your help.

Lzc6996 commented 7 years ago

I have solve it. change output = utils.data_parallel(self.rnn, conv, self.ngpu) to output = self.rnn(conv) makes it work.

meijieru commented 7 years ago

@XuLiangFRDC Please open another issue as it seems not to be this problem

XuLiangFRDC commented 7 years ago

@meijieru I have opened another issue. Please see:

41

thanks!

XuLiangFRDC commented 7 years ago

@Lzc6996 Could you please give more details according to your solution: I have solve it. change output = utils.data_parallel(self.rnn, conv, self.ngpu) to output = self.rnn(conv) makes it work.

I modified the file (crnn.pytorch-master/models/crnn.py) as following: import utils

replace code: output = self.rnn(conv) as: output = utils.data_parallel(self.rnn, conv, self.ngpu)

And I added the file (utils.py) into dir ((crnn.pytorch-master/models): the new file utils.py is defined as following:

!/usr/bin/python

encoding: utf-8

import torch.nn as nn import torch.nn.parallel

def data_parallel(model, input, ngpu): if isinstance(input.data, torch.cuda.FloatTensor) and ngpu > 1: output = nn.parallel.data_parallel(model, input, range(ngpu)) else: output = model(input) return output

But it doesn't work. Why? The command is: python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --adadelta --keep_ratio --random_sample

The following error message is: Namespace(adadelta=True, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=1, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2) Random Seed: 5088 CRNN ( (cnn): Sequential ( (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu0): ReLU (inplace) (pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu1): ReLU (inplace) (pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) (relu2): ReLU (inplace) (conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu3): ReLU (inplace) (pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu4): ReLU (inplace) (conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu5): ReLU (inplace) (pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1)) (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu6): ReLU (inplace) ) (rnn): Sequential ( (0): BidirectionalLSTM ( (rnn): LSTM(512, 256, bidirectional=True) (embedding): Linear (512 -> 256) ) (1): BidirectionalLSTM ( (rnn): LSTM(256, 256, bidirectional=True) (embedding): Linear (512 -> 37) (rnn): LSTM(256, 256, bidirectional=True) (embedding): Linear (512 -> 37) ) ) )

Traceback (most recent call last): File "crnn_main.py", line 197, in cost = trainBatch(crnn, criterion, optimizer) File "crnn_main.py", line 180, in trainBatch preds = crnn(image) File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(*input, kwargs) File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 59, in forward return self.module(*inputs[0], *kwargs[0]) File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(input, kwargs) File "/home/xuliang/CRNN_pytorch_v2/crnn.pytorch/models/crnn.py", line 83, in forward output = utils.data_parallel(self.rnn, conv, self.ngpu) File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 238, in getattr type(self).name, name)) AttributeError: 'CRNN' object has no attribute 'ngpu'

By the way, I changed the code to include attribute 'ngpu' and this problem is solved. But for Multi-GPU, there still exists problem, Please see:

41