Closed yzhq97 closed 5 years ago
Hi @yzhq97,
You might want to make sure your cfg.FPN.FPN_ON is turned on in your yaml file. If it is on, check "lib/modeling/model_builder.py" and "lib/modeling/rpn_heads.py" to make sure the code calls FPN.fpn_rpn_outputs() to construct the RPN module. A quick way to check this is to print the keys of "rpn_ret" after line 169 ("rpn_ret = self.RPN(blob_conv, im_info, roidb)") in "lib/modeling/model_builder.py".
I double-check the code on my local machine and I didn't have this error. Please let me know when you find the solution so I can tell if there is something wrong with this version of the repo. Many thanks!
I confirm that cfg.FPN.FPN_ON
is True
. I printed out rpn_ret.keys()
, does this look right?
INFO model_builder.py: 170: rpn_ret keys: ['rpn_cls_logits_fpn2', 'rpn_bbox_pred_fpn2', 'rpn_rois_fpn2', 'rpn_rois_prob_fpn2', 'rpn_cls_logits_fpn3', 'rpn_bbox
_pred_fpn3', 'rpn_rois_fpn3', 'rpn_rois_prob_fpn3', 'rpn_cls_logits_fpn4', 'rpn_bbox_pred_fpn4', 'rpn_rois_fpn4', 'rpn_rois_prob_fpn4', 'rpn_cls_logits_fpn5',
'rpn_bbox_pred_fpn5', 'rpn_rois_fpn5', 'rpn_rois_prob_fpn5', 'rpn_cls_logits_fpn6', 'rpn_bbox_pred_fpn6', 'rpn_rois_fpn6', 'rpn_rois_prob_fpn6', 'rois', 'rois_
fpn2', 'rois_fpn3', 'rois_fpn4', 'rois_fpn5', 'rois_idx_restore_int32']
Another small issue is that you forgot to mention in the README that GoogleNews-vectors-negative300.bin
need to be downloaded and put under data/word2vec_model
I encountered another error that probably relates to this. The error occurs when I run train_nel_step_rel.py
:
Traceback (most recent call last):
File "tools/train_net_step_rel.py", line 473, in <module>
main()
File "tools/train_net_step_rel.py", line 443, in main
net_outputs = maskRCNN(**input_data)
File "/mnt/lustre/yangzhuoqian/anaconda3/envs/py36torch04/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/data_parallel.py", line 111, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/data_parallel.py", line 139, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/parallel_apply.py", line 42, in _worker
output = module(*input, **kwargs)
File "/mnt/lustre/yangzhuoqian/anaconda3/envs/py36torch04/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/modeling/model_builder_rel.py", line 242, in forward
return self._forward(data, im_info, dataset_name, roidb, use_gt_labels, **rpn_kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/modeling/model_builder_rel.py", line 432, in _forward
return_dict['losses'][k] = v.unsqueeze(0)
AttributeError: 'list' object has no attribute 'unsqueeze'
I printed out all the k
s and v
s in the return_dict
and found that losses['loss_rpn_cls']
and losses['loss_rpn_bbox']
are lists of tensors and contain nan
values:
INFO model_builder_rel.py: 427: losses: loss_rpn_cls -> [tensor(nan., device='cuda:0'), tensor(nan., device='cuda:0'), tensor(nan., device='cuda:0'), tens
or(nan., device='cuda:0'), tensor(nan., device='cuda:0')]
INFO model_builder_rel.py: 427: losses: loss_rpn_bbox -> [tensor(nan., device='cuda:0'), tensor(nan., device='cuda:0'), tensor(nan., device='cuda:0'), ten
sor(nan., device='cuda:0'), tensor(nan., device='cuda:0')]
These two losses are returned by the FPN.fpn_rpn_losses
function in the rpn_heads
module.
def fpn_rpn_losses(**kwargs):
"""Add RPN on FPN specific losses."""
losses_cls = []
losses_bbox = []
for lvl in range(cfg.FPN.RPN_MIN_LEVEL, cfg.FPN.RPN_MAX_LEVEL + 1):
...
losses_cls.append(loss_rpn_cls_fpn)
losses_bbox.append(loss_rpn_bbox_fpn)
return losses_cls, losses_bbox
Hi @yzhq97,
Thank you for notifying me about the missing word2vec model link.
About the nan loss, I've never met that before so I don't really know the answer. I think it'd better if you can make sure you are able to train an object detector on VG200, then you can simply replace your dataset to gqa and see if this error still occurs.
I am able to train the object detector on both VG and GQA. But when I attempted to train the relation detection model on VG the same error occurs:
Traceback (most recent call last):
File "tools/train_net_step_rel.py", line 473, in <module>
main()
File "tools/train_net_step_rel.py", line 443, in main
net_outputs = maskRCNN(**input_data)
File "/mnt/lustre/yangzhuoqian/anaconda3/envs/py36torch04/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/data_parallel.py", line 111, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/data_parallel.py", line 139, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/parallel_apply.py", line 67, in parallel_apply
raise output
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/nn/parallel/parallel_apply.py", line 42, in _worker
output = module(*input, **kwargs)
File "/mnt/lustre/yangzhuoqian/anaconda3/envs/py36torch04/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/modeling/model_builder_rel.py", line 242, in forward
return self._forward(data, im_info, dataset_name, roidb, use_gt_labels, **rpn_kwargs)
File "/mnt/lustre/yangzhuoqian/codespace/lsvrd-origin/lib/modeling/model_builder_rel.py", line 446, in _forward
return_dict['losses'][k] = v.unsqueeze(0)
AttributeError: 'list' object has no attribute 'unsqueeze'
that losses['loss_rpn_cls']
and losses['loss_rpn_bbox']
are list
s instead of torch.Tensor
s. Only that this time they do not contain nan
values.
I am using pytorch==0.4.1
with Cuda 9.0
on 8 1080Ti
GPUs. I will keep searching for a solution.
I modified line 391-392 in lib/modeling/model_builder_rel.py
return_dict['losses']['loss_rpn_cls'] = loss_rpn_cls
return_dict['losses']['loss_rpn_bbox'] = loss_rpn_bbox
to
if isinstance(loss_rpn_cls, list):
for i in range(len(loss_rpn_cls)):
return_dict['losses']['loss_rpn_cls_%d'%i] = loss_rpn_cls[i]
else:
return_dict['losses']['loss_rpn_cls'] = loss_rpn_cls
if isinstance(loss_rpn_bbox, list):
for i in range(len(loss_rpn_bbox)):
return_dict['losses']['loss_rpn_bbox_%d'%i] = loss_rpn_bbox[i]
else:
return_dict['losses']['loss_rpn_bbox'] = loss_rpn_bbox
Now I am able to run test_net.py
and train_net_step_rel.py
properly. (I am pretty sure that there's something wrong with my GQA data now.)
Could you please provide your code for generating Visual Genome annotations? It could be really helpful for me spotting problems in my data.
I can run train_net_step_rel.py
on GQA dataset now. I noticed that the filter_for_training
function in roidb_rel.py
filters out certain entries. Whenever there are entries filtered, the above error occurs. I solved it by printing out the image_id
s of the filtered entries and manually removing them from the data I generated.
Hi, I recently trained an object detector on
GQA
and tried to test it:encountered the following error:
Could you please provide some insights regarding how this may be solved? Appreciate it!