Open xuewyang opened 4 years ago
I am using pytorch 1.4.0 with python 3.7
I have the same error for the 3rd time. The error is at get_self_critical_reward(), when I print out print(ciderscores.shape, len(gts), len(res_), cider_scores, batch_size, seq_per_img, gen_resultsize), I got this: (2,) 90 90 [0.22051903 0.22582808] 15 5 75 It is weird that the input gts and res_ are of size 90 but the output cider_scores are of size (2, ).
That's very weird. I clone the master, and ran self-critical. Didn't see that behavior.
Yes, it is very weird. I run SC after 30 epochs. Sometimes, the error occurred on epoch 31 sometimes on epoch 32. I clone the master. I am running it again. I have a question. The followings are the files I used. I downloaded these files from this link: https://drive.google.com/drive/folders/1eCdz62FAVCGogOuNhy87Nmlo5_I0sH2J input_json: cocotalk.json input_fc_dir: cocobu_fc input_att_dir: cocobu_att input_label_h5: cocotalk_label.h5
I didn't actually preprocess the data. I just used the data you provided. I notice there is no cocobu_label.h5, either cocobu.json. Is this normal?
That's fine. bu is the short form of bottom up. It's just the image feature.
I re-clone master and the same error occurred. I am using 2 11GB GPUs. I found that if I set-up the batch size > 40, then about 90% of the GPU memories are taken and the error occurred. But if I reduce the batch size to about 24, the error disappears. It is pretty weird. I am running the whole process. Will know if this error will happen again or not.
Keep me posted. Thank you.
Reducing batch size solved this problem. TY.
I catch this problem again. I train with 8 GPUs.
Batch size 10, occur at val 11530/11539 Batch size 8, occur at val 11536/11539
Maybe the last batch can’t fit the batch size? Set Dataloader ( drop_last = True ) may fix.
Update: drop_last = True works, but I get stuck in BLEU, assert(len(hypo) == 1) Fail
I run the code twice, and got the same error. Please check the following:
This is the first time:
iter 44400 (epoch 30), avg_reward = -0.346, time/batch = 1.242 Traceback (most recent call last): File "train.py", line 285, in
train(opt)
File "train.py", line 176, in train
model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag, struc_flag)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, kwargs)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, *kwargs)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, kwargs)
File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/loss_wrapper.py", line 60, in forward
reward = get_self_critical_reward(greedy_res, gts, gen_result, self.opt)
File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/rewards.py", line 76, in get_self_critical_reward
scores = scores[:gen_result_size].reshape(batch_size, seq_per_img) - scores[-batch_size:][:, np.newaxis]
ValueError: cannot reshape array of size 3 into shape (16,5)
This is the second time:
iter 47280 (epoch 31), avg_reward = -0.167, time/batch = 1.206 /home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice. out=out, kwargs) /home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "train.py", line 285, in
train(opt)
File "train.py", line 176, in train
model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag, struc_flag)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(input, kwargs)
File "/home/xuewyang/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/loss_wrapper.py", line 60, in forward
reward = get_self_critical_reward(greedy_res, gts, gen_result, self.opt)
File "/home/xuewyang/Xuewen/Research/graph-captioning/misc/rewards.py", line 76, in get_self_critical_reward
scores = scores[:gen_result_size].reshape(batch_size, seq_per_img) - scores[-batch_size:][:, np.newaxis]
ValueError: cannot reshape array of size 0 into shape (15,5)
Do you know what are the reasons? I am running the 3rd time now and I put some breakpoints and try to debug it. I run the up-down with SC, no error occurred.