ruotianluo / self-critical.pytorch

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.
MIT License
997 stars 278 forks source link

Error: transformer with SC on multi GPU #189

Open xuewyang opened 4 years ago

xuewyang commented 4 years ago

I run the code twice, and got the same error. Please check the following:

This is the first time:

iter 44400 (epoch 30), avg_reward = -0.346, time/batch = 1.242 Traceback (most recent call last): File "train.py", line 285, in train(opt) File "train.py", line 176, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag, struc_flag) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/loss_wrapper.py", line 60, in forward reward = get_self_critical_reward(greedy_res, gts, gen_result, self.opt) File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/rewards.py", line 76, in get_self_critical_reward scores = scores[:gen_result_size].reshape(batch_size, seq_per_img) - scores[-batch_size:][:, np.newaxis] ValueError: cannot reshape array of size 3 into shape (16,5)

This is the second time:

iter 47280 (epoch 31), avg_reward = -0.167, time/batch = 1.206 /home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice. out=out, kwargs) /home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "train.py", line 285, in train(opt) File "train.py", line 176, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag, struc_flag) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(input, kwargs) File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/loss_wrapper.py", line 60, in forward reward = get_self_critical_reward(greedy_res, gts, gen_result, self.opt) File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/rewards.py", line 76, in get_self_critical_reward scores = scores[:gen_result_size].reshape(batch_size, seq_per_img) - scores[-batch_size:][:, np.newaxis] ValueError: cannot reshape array of size 0 into shape (15,5)

Do you know what are the reasons? I am running the 3rd time now and I put some breakpoints and try to debug it. I run the up-down with SC, no error occurred.

xuewyang commented 4 years ago

I am using pytorch 1.4.0 with python 3.7

xuewyang commented 4 years ago

I have the same error for the 3rd time. The error is at get_self_critical_reward(), when I print out print(ciderscores.shape, len(gts), len(res_), cider_scores, batch_size, seq_per_img, gen_resultsize), I got this: (2,) 90 90 [0.22051903 0.22582808] 15 5 75 It is weird that the input gts and res_ are of size 90 but the output cider_scores are of size (2, ).

ruotianluo commented 4 years ago

That's very weird. I clone the master, and ran self-critical. Didn't see that behavior.

xuewyang commented 4 years ago

Yes, it is very weird. I run SC after 30 epochs. Sometimes, the error occurred on epoch 31 sometimes on epoch 32. I clone the master. I am running it again. I have a question. The followings are the files I used. I downloaded these files from this link: https://drive.google.com/drive/folders/1eCdz62FAVCGogOuNhy87Nmlo5_I0sH2J input_json: cocotalk.json input_fc_dir: cocobu_fc input_att_dir: cocobu_att input_label_h5: cocotalk_label.h5

I didn't actually preprocess the data. I just used the data you provided. I notice there is no cocobu_label.h5, either cocobu.json. Is this normal?

ruotianluo commented 4 years ago

That's fine. bu is the short form of bottom up. It's just the image feature.

xuewyang commented 4 years ago

I re-clone master and the same error occurred. I am using 2 11GB GPUs. I found that if I set-up the batch size > 40, then about 90% of the GPU memories are taken and the error occurred. But if I reduce the batch size to about 24, the error disappears. It is pretty weird. I am running the whole process. Will know if this error will happen again or not.

ruotianluo commented 4 years ago

Keep me posted. Thank you.

xuewyang commented 4 years ago

Reducing batch size solved this problem. TY.

linzhlalala commented 3 years ago

I catch this problem again. I train with 8 GPUs.

Batch size 10, occur at val 11530/11539 Batch size 8, occur at val 11536/11539

Maybe the last batch can’t fit the batch size? Set Dataloader ( drop_last = True ) may fix.

Update: drop_last = True works, but I get stuck in BLEU, assert(len(hypo) == 1) Fail