Open XuLiangFRDC opened 7 years ago
@Lzc6996 gives such a solution: I have solve it. change output = utils.data_parallel(self.rnn, conv, self.ngpu) to output = self.rnn(conv) makes it work.
My comments: I modified the crnn.py as above: replace code: output = self.rnn(conv) as: output = utils.data_parallel(self.rnn, conv, self.ngpu)
But it doesn't work. @meijieru Would you please take time to check this problem? Thanks.
@XuLiangFRDC Our version of code is different. My code use data_parallel to use multigpu, but the current version use DataParallel. I have try to use DataParallel to solve my problem, but I get the same error as yours. so I go back to data_parallel , and my solution is useful for my version of code. By the way, when you solve this problem, you will meet anthor, cause by BN.
@Lzc6996 I changed the code to the old version. And I modified the crnn.py in dir (crnn.pytorch-master/models) as you suggest: replace code: output = utils.data_parallel(self.rnn, conv, self.ngpu) as: output = self.rnn(conv)
Now it works for multi-GPU training. Up to now, it is OK. Thanks a lot.
@XuLiangFRDC But you won't get the accuracy as using single GPU. Because when you use two gpu, every gpu has its own BN. So maybe you need to double your batchSize or define the BN in a single gpu. In fact, if your code works well, you need to do nothing.
I'm on the same boat with you guys @XuLiangFRDC @Lzc6996 I'm using the current code. Is there any way to solve this ? Or would it be possible to open your code ? Thanks a lot.
@XuLiangFRDC Hello, I tried your trick. But I got this error message: ConnectionRefusedError: [Errno 111] Connection refused
How did it happen?
My OS is CentOS Linux release 7.2.1511 (Core), x86_64. My server has four Tesla P40 GPU cards.
The command is: python crnn_main.py --trainroot /home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort --valroot /home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort --cuda --ngpu 4 --adadelta --keep_ratio --random_sample And error message is: Namespace(adadelta=True, adam=False, alphabet='0123456789abcdefghijklmnopqrstuvwxyz', batchSize=64, beta1=0.5, crnn='', cuda=True, displayInterval=500, experiment=None, imgH=32, imgW=100, keep_ratio=True, lr=0.01, n_test_disp=10, ngpu=4, nh=256, niter=25, random_sample=True, saveInterval=500, trainroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_train_sort', valInterval=500, valroot='/home/xuliang/CRNN_org/crnn/tool/synth90k_val_sort', workers=2) Random Seed: 7716 CRNN ( (cnn): Sequential ( (conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu0): ReLU (inplace) (pooling0): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu1): ReLU (inplace) (pooling1): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1)) (conv2): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) (relu2): ReLU (inplace) (conv3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu3): ReLU (inplace) (pooling2): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv4): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (batchnorm4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu4): ReLU (inplace) (conv5): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (relu5): ReLU (inplace) (pooling3): MaxPool2d (size=(2, 2), stride=(2, 1), dilation=(1, 1)) (conv6): Conv2d(512, 512, kernel_size=(2, 2), stride=(1, 1)) (batchnorm6): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) (relu6): ReLU (inplace) ) (rnn): Sequential ( (0): BidirectionalLSTM ( (rnn): LSTM(512, 256, bidirectional=True) (embedding): Linear (512 -> 256) ) (1): BidirectionalLSTM ( (rnn): LSTM(256, 256, bidirectional=True) (embedding): Linear (512 -> 37) ) ) ) [0/25][500/112885] Loss: 16.009935 Start val Traceback (most recent call last): File "crnn_main.py", line 207, in val(crnn, test_dataset, criterion) File "crnn_main.py", line 158, in val sim_preds = converter.decode(preds.data, preds_size.data, raw=False) File "/home/xuliang/CRNN_pytorch_v2/crnn.pytorch/utils.py", line 51, in decode t[index:index + l], torch.IntTensor([l]), raw=raw)) ValueError: result of slicing is an empty tensor Thanks for your help.