Train new dataset: zeros after conv3 in vgg16

kduy commented 7 years ago

I am trying to train the model with my own dataset. Sometimes , I got this error

  File "train.py", line 127, in <module>
    net(im_data, im_info, gt_boxes, gt_ishard, dontcare_areas)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/faster_rcnn.py", line 219, in forward
    roi_data = self.proposal_target_layer(rois, gt_boxes, gt_ishard, dontcare_areas, self.n_classes)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/faster_rcnn.py", line 287, in proposal_target_layer
    proposal_target_layer_py(rpn_rois, gt_boxes, gt_ishard, dontcare_areas, num_classes)
  File "/data/code/faster_rcnn_pytorch/faster_rcnn/rpn_msr/proposal_target_layer.py", line 66, in proposal_target_layer
    np.hstack((zeros, np.vstack((gt_easyboxes[:, :-1], jittered_gt_boxes[:, :-1]))))))
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/shape_base.py", line 234, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

I traced the bug and figure out that it returns zeros array after conv3 in faster_rcnn/vgg16.py, hence return zero-array feature after forwarding through vgg16 Do you have any clue why ? Thank yah.

abhiML commented 7 years ago

Same problem. Any solution? Any help would be appreciated. @acgtyrant @kduy

abhiML commented 7 years ago

@longcw

acgtyrant commented 7 years ago

@abhiML I am refactoring the program, and it's still ongoing. So I have not get it worked as so far.

acgtyrant commented 7 years ago

Do you load pretrained npy for vgg16?

abhiML commented 7 years ago

Yeah first it gives a runtime warning:

RuntimeWarning: invalid value encountered in greater_equal
  keep = np.where((ws >= min_size) & (hs >= min_size))[0]

acgtyrant commented 7 years ago

Do you use Python 2? I do not encounter this error.

abhiML commented 7 years ago

Yeah I am using 2.7. You are running it on your own dataset?

acgtyrant commented 7 years ago

I ran it a few steps in PASCAL VOC 2007 trainval dataset, no problem. If you want to run it on the new dataset, you must adjust the source code by yourself.

abhiML commented 7 years ago

Yeah but what all do I have to adjust? I just changed the classes in pascal_voc.py and prepared the dataset according to the Pascal VOC 2007 set.

acgtyrant commented 7 years ago

I have not train the model in the new dataset, wait.

abhiML commented 7 years ago

Okay

abhiML commented 7 years ago

https://github.com/rbgirshick/py-faster-rcnn/issues/65 Could you take a look at this issue ?

abhiML commented 7 years ago

@acgtyrant going by https://github.com/longcw/faster_rcnn_pytorch/blob/master/faster_rcnn/network.py#L109 as far as I understood if the totalnorm becomes very large, then the norm gets really small and underflow occurs? Is that correct?

acgtyrant commented 7 years ago

No, it is used to prevent overflow occurs.

abhiML commented 7 years ago

But I am using that function. Still I am getting the error.

gls81 commented 7 years ago

I had the issue described and I now seem to be able to train without this error when using SDG or if you use ADAM loss will equal NAN, I would suggest you check the values in the gt_boxes of any image cause this error. For me when reading the xml files it was assigning some negative values which where being transformed to huge numbers. Also the PASCALVOC uses -1 on the XMIN and YMIN so if your bounding boxes are set at 0 they will be set to -1 and this caused issues as well. I fixed this in my _load_AFLW_annotation function by making sure the absolute value was taken and if a value was equal to 0 don't do a subtraction. This may help.

abhiML commented 7 years ago

Yeah I was making a similar mistake. In the dataset some of the annotations were wrong (xmin>xmax). Once I corrected those and set the negative values to 0, it worked fine.

liyuanyaun commented 7 years ago

i have checked my annotations and it is right for experiment, so do anyone know any other bug that would lead to this problem?

zhyx12 commented 7 years ago

@liyuanyaun I have encountered this problem too. After discard the shuffle operation in RoIDataLayer()，and locate which image the error occurs. I found that one of the bounding boxes has xmin=0, and voc_pascal.py which I imitated has -1 operation, so gt_boxes got a negative value. Here is an issue relative to this: https://github.com/rbgirshick/py-faster-rcnn/issues/9 (you can search 'based') After remove -1 and delete ground truth .pkl file(needed if you created before), the error is gone.

longcw / faster_rcnn_pytorch

Train new dataset: zeros after conv3 in vgg16 #20