Open xinfu607 opened 5 years ago
Is this only happening when using multi gpu?
Yes. I don't meet error when using one gpu.
Can you try pth1.1? See if the error information will be more useful.
Thanks. I will try it. The error don't happen without RL for pth 0.4.1 when using multi gpu. Could you please tell me the reason?
In addition, the error don't happen for topdown model with RL for pth 0.4.1 when using multi gpu.
I don't know. I've trained a transformer with 4gpus with this code. And it's fine.
Do you have any progress?
Not yet. Recently, there are many programs running on our GPU server. When these programs are finished, I will change the version of PTH to 1.1.
I am also having the same issue with PyTorch 1.1 The problem is on this line https://github.com/ruotianluo/self-critical.pytorch/blob/master/misc/loss_wrapper.py#L24
@Mujtaba-Asif can you print the error for me?
I am sorry to disturb you. I encounter the following errors with RL when I run the program with three GPUs. The batchsize is 18. The version of python and pytorch is 2.7.13 and 0.4.1, respectively.
DataLoader loading json file: /data2/fuxin/bottomup_topdown/data/dataset_coco_talk.json vocab size is 9487 DataLoader loading h5 file: /data2/fuxin/bottomup_topdown/data/cocobu_fc /data2/fuxin/bottomup_topdown/data/cocobu_att /data2/fuxin/bottomup_topdown/data/cocobu_box /data2/fuxin/bottomup_topdown/data/dataset_coco_talk_label.h5 max sequence length in data is 16 read 123287 image features assigned 113287 images to split train assigned 5000 images to split val assigned 5000 images to split test /home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/_functions.py:58: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' iter 121880 (epoch 85), avg_reward = -0.372, time/batch = 5.653 Traceback (most recent call last): File "train.py", line 267, in
train(opt)
File "train.py", line 161, in train
model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag)
File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/fuxin/soft/anaconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
ValueError: operands could not be broadcast together with shapes (30,) (43,)
Terminating BlobFetcher