Got error for transformer with RL on multi-gpu

xinfu607 commented 5 years ago

I am sorry to disturb you. I encounter the following errors with RL when I run the program with three GPUs. The batchsize is 18. The version of python and pytorch is 2.7.13 and 0.4.1, respectively.

DataLoader loading json file: /data2/fuxin/bottomup_topdown/data/dataset_coco_talk.json vocab size is 9487 DataLoader loading h5 file: /data2/fuxin/bottomup_topdown/data/cocobu_fc /data2/fuxin/bottomup_topdown/data/cocobu_att /data2/fuxin/bottomup_topdown/data/cocobu_box /data2/fuxin/bottomup_topdown/data/dataset_coco_talk_label.h5 max sequence length in data is 16 read 123287 image features assigned 113287 images to split train assigned 5000 images to split val assigned 5000 images to split test /home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/_functions.py:58: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' iter 121880 (epoch 85), avg_reward = -0.372, time/batch = 5.653 Traceback (most recent call last): File "train.py", line 267, in train(opt) File "train.py", line 161, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag) File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply raise output ValueError: operands could not be broadcast together with shapes (30,) (43,) Terminating BlobFetcher

ruotianluo commented 5 years ago

Is this only happening when using multi gpu?

xinfu607 commented 5 years ago

Yes. I don't meet error when using one gpu.

ruotianluo commented 5 years ago

Can you try pth1.1? See if the error information will be more useful.

xinfu607 commented 5 years ago

Thanks. I will try it. The error don't happen without RL for pth 0.4.1 when using multi gpu. Could you please tell me the reason?

xinfu607 commented 5 years ago

In addition, the error don't happen for topdown model with RL for pth 0.4.1 when using multi gpu.

ruotianluo commented 5 years ago

I don't know. I've trained a transformer with 4gpus with this code. And it's fine.

ruotianluo commented 5 years ago

Do you have any progress?

xinfu607 commented 5 years ago

Not yet. Recently, there are many programs running on our GPU server. When these programs are finished, I will change the version of PTH to 1.1.

mujtabaasif commented 5 years ago

I am also having the same issue with PyTorch 1.1 The problem is on this line https://github.com/ruotianluo/self-critical.pytorch/blob/master/misc/loss_wrapper.py#L24

ruotianluo commented 4 years ago

@Mujtaba-Asif can you print the error for me?

ruotianluo / self-critical.pytorch

Got error for transformer with RL on multi-gpu #94