meijieru / crnn.pytorch

Convolutional recurrent network in pytorch
MIT License
2.4k stars 657 forks source link

when using Multi-GPU training , get "ValueError: result of slicing is an empty tensor" #41

Open XuLiangFRDC opened 7 years ago

XuLiangFRDC commented 7 years ago

My OS is CentOS Linux release 7.2.1511 (Core), x86_64. My server has four Tesla P40 GPU cards.

The command is: python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --ngpu 4 --adadelta --keep_ratio --random_sample And error message is: Namespace(adadelta=True, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=4, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2) Random Seed: 7716 CRNN ( (cnn): Sequential ( (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu0): ReLU (inplace) (pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu1): ReLU (inplace) (pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) (relu2): ReLU (inplace) (conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu3): ReLU (inplace) (pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu4): ReLU (inplace) (conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu5): ReLU (inplace) (pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1)) (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu6): ReLU (inplace) ) (rnn): Sequential ( (0): BidirectionalLSTM ( (rnn): LSTM(512, 256, bidirectional=True) (embedding): Linear (512 -> 256) ) (1): BidirectionalLSTM ( (rnn): LSTM(256, 256, bidirectional=True) (embedding): Linear (512 -> 37) ) ) ) [0/25][500/112885] Loss: 16.009935 Start val Traceback (most recent call last): File "crnn_main.py", line 207, in  val(crnn, test_dataset, criterion) File "crnn_main.py", line 158, in val sim_preds = converter.decode(preds.data, preds_size.data, raw=False) File "/home/xuliang/CRNN_pytorch_v2/crnn.pytorch/utils.py", line 51, in decode t[index:index + l], torch.IntTensor([l]), raw=raw)) ValueError: result of slicing is an empty tensor Thanks for your help.

XuLiangFRDC commented 7 years ago

@Lzc6996 gives such a solution: I have solve it. change output = utils.data_parallel(self.rnn, conv, self.ngpu) to output = self.rnn(conv) makes it work.

My comments: I modified the crnn.py as above: replace code: output = self.rnn(conv) as: output = utils.data_parallel(self.rnn, conv, self.ngpu)

But it doesn't work. @meijieru Would you please take time to check this problem? Thanks.

Lzc6996 commented 7 years ago

@XuLiangFRDC Our version of code is different. My code use data_parallel to use multigpu, but the current version use DataParallel. I have try to use DataParallel to solve my problem, but I get the same error as yours. so I go back to data_parallel , and my solution is useful for my version of code. By the way, when you solve this problem, you will meet anthor, cause by BN.

XuLiangFRDC commented 7 years ago

@Lzc6996 I changed the code to the old version. And I modified the crnn.py in dir (crnn.pytorch-master/models) as you suggest: replace code: output = utils.data_parallel(self.rnn, conv, self.ngpu) as: output = self.rnn(conv)

Now it works for multi-GPU training. Up to now, it is OK. Thanks a lot.

Lzc6996 commented 7 years ago

@XuLiangFRDC But you won't get the accuracy as using single GPU. Because when you use two gpu, every gpu has its own BN. So maybe you need to double your batchSize or define the BN in a single gpu. In fact, if your code works well, you need to do nothing.

haikuoyao commented 7 years ago

I'm on the same boat with you guys @XuLiangFRDC @Lzc6996 I'm using the current code. Is there any way to solve this ? Or would it be possible to open your code ? Thanks a lot.

munziliashali commented 6 years ago

@XuLiangFRDC Hello, I tried your trick. But I got this error message: ConnectionRefusedError: [Errno 111] Connection refused

How did it happen?