sanghoon / pva-faster-rcnn

Demo code for PVANet
https://arxiv.org/abs/1611.08588
Other
650 stars 241 forks source link

Training loss is nan for custom dataset #58

Closed vj-1988 closed 7 years ago

vj-1988 commented 7 years ago

@sanghoon : Hi,

I am trying to train my custom dataset with 64 classes using pvanet. I am able to start the training using the following command

'./tools/train_net.py --gpu 1 --solver models/pvanet/example_train/solver.prototxt --imdb voc_2007_trainval --iters 70000 --cfg models/pvanet/cfgs/train.yml --weights models/pvanet/example_train/pva9.1_pretrained_no_fc6.caffemodel'

The training loss starts around 5.

'I0201 18:00:12.837128 22329 solver.cpp:238] Iteration 0, loss = 5.07012 I0201 18:00:12.837157 22329 solver.cpp:254] Train net output #0: loss_bbox = 0.0719297 ( 1 = 0.0719297 loss) I0201 18:00:12.837162 22329 solver.cpp:254] Train net output #1: loss_cls = 4.31714 ( 1 = 4.31714 loss) I0201 18:00:12.837165 22329 solver.cpp:254] Train net output #2: rpn_loss_bbox = 0.000328277 ( 1 = 0.000328277 loss) I0201 18:00:12.837169 22329 solver.cpp:254] Train net output #3: rpn_loss_cls = 0.692441 ( 1 = 0.692441 loss) I0201 18:00:12.837173 22329 sgd_solver.cpp:138] Iteration 0, lr = 0.0001 /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:24: RuntimeWarning: invalid value encountered in log targets_dh = np.log(gt_heights / ex_heights) /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:166: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False) /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:177: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False) /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:184: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future labels[fg_rois_per_this_image:] = 0 I0201 18:00:21.564218 22329 solver.cpp:238] Iteration 20, loss = nan I0201 18:00:21.564246 22329 solver.cpp:254] Train net output #0: loss_bbox = 0.0315711 ( 1 = 0.0315711 loss) I0201 18:00:21.564251 22329 solver.cpp:254] Train net output #1: loss_cls = 0.850022 ( 1 = 0.850022 loss) I0201 18:00:21.564256 22329 solver.cpp:254] Train net output #2: rpn_loss_bbox = 0.00812243 ( 1 = 0.00812243 loss) I0201 18:00:21.564260 22329 solver.cpp:254] Train net output #3: rpn_loss_cls = 0.690506 ( 1 = 0.690506 loss) I0201 18:00:21.564265 22329 sgd_solver.cpp:138] Iteration 20, lr = 0.0001 '

Then the loss immediately jumps to nan. I tried reducing the learning rate as low as 0.000001. But the problem persists. Has anyone else faced this situation?

nynorbert commented 7 years ago

Hi,

I have similiar problem with a custom dataset but I don't know how to solve yet. Do you have any progress with this problem?

In my case sometimes my rpn_loss_bbox also "nan":

I0207 21:43:03.959089 12331 solver.cpp:229] Iteration 640, loss = nan I0207 21:43:03.959159 12331 solver.cpp:245] Train net output #0: loss_bbox = 0 ( 1 = 0 loss) I0207 21:43:03.959174 12331 solver.cpp:245] Train net output #1: loss_cls = 0 ( 1 = 0 loss) I0207 21:43:03.959187 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.653614 ( 1 = 0.653614 loss) I0207 21:43:03.959197 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = nan ( 1 = nan loss) I0207 21:43:03.959209 12331 sgd_solver.cpp:106] Iteration 640, lr = 0.0001 I0207 21:43:23.685860 12331 solver.cpp:229] Iteration 660, loss = nan I0207 21:43:23.685931 12331 solver.cpp:245] Train net output #0: loss_bbox = 0.000342935 ( 1 = 0.000342935 loss) I0207 21:43:23.685945 12331 solver.cpp:245] Train net output #1: loss_cls = 0.111643 ( 1 = 0.111643 loss) I0207 21:43:23.685956 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.809447 ( 1 = 0.809447 loss) I0207 21:43:23.685966 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = 1.3496 ( 1 = 1.3496 loss) I0207 21:43:23.685976 12331 sgd_solver.cpp:106] Iteration 660, lr = 0.0001 I0207 21:43:43.506032 12331 solver.cpp:229] Iteration 680, loss = nan I0207 21:43:43.506108 12331 solver.cpp:245] Train net output #0: loss_bbox = 0 ( 1 = 0 loss) I0207 21:43:43.506120 12331 solver.cpp:245] Train net output #1: loss_cls = 0 ( 1 = 0 loss) I0207 21:43:43.506134 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.733904 ( 1 = 0.733904 loss) I0207 21:43:43.506145 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = nan ( 1 = nan loss) I0207 21:43:43.506156 12331 sgd_solver.cpp:106] Iteration 680, lr = 0.0001 I0207 21:44:03.204167 12331 solver.cpp:229] Iteration 700, loss = nan I0207 21:44:03.204233 12331 solver.cpp:245] Train net output #0: loss_bbox = 0 ( 1 = 0 loss) I0207 21:44:03.204252 12331 solver.cpp:245] Train net output #1: loss_cls = 0 ( 1 = 0 loss) I0207 21:44:03.204260 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.795648 ( 1 = 0.795648 loss) I0207 21:44:03.204272 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = nan ( 1 = nan loss) I0207 21:44:03.204288 12331 sgd_solver.cpp:106] Iteration 700, lr = 0.0001 I0207 21:44:23.145709 12331 solver.cpp:229] Iteration 720, loss = 0.902763 I0207 21:44:23.145797 12331 solver.cpp:245] Train net output #0: loss_bbox = 2.89623e-05 ( 1 = 2.89623e-05 loss) I0207 21:44:23.145810 12331 solver.cpp:245] Train net output #1: loss_cls = 0.0984289 ( 1 = 0.0984289 loss) I0207 21:44:23.145823 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.562016 ( 1 = 0.562016 loss) I0207 21:44:23.145830 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = 0.27673 ( 1 = 0.27673 loss)

I haven't used any pretrained and my learning rate is: 0.0001

vj-1988 commented 7 years ago

@nynorbert: I haven't solved the issue. But if we use pretrained model, I guess at least the mbox loss wouldn't be nan.

Can some one confirm if we have to use a pretrained model that is also trained from our custom dataset classes like we usually do in py-faster rcnn.

MyVanitar commented 7 years ago

@vj-1988 @nynorbert

I don't have the answer to any of these questions and it is very good that you come-up to this step without these information.

vj-1988 commented 7 years ago

@VanitarNordic : We don't create LMDB in pvanet training. I followed the steps in this page

https://github.com/sanghoon/pva-faster-rcnn/tree/master/models

MyVanitar commented 7 years ago

@vj-1988 if we have our own custom image dataset, then when the model should be trained on? I want to use PVANET+ (Compressed)

smg478 commented 7 years ago

I had this issue. I found one of my annotation .xml files was wrongly annotated (negative value in coordinates). I corrected the annotation and after that training was fine. No nan values anymore.

vj-1988 commented 7 years ago

@smg478 : Thank you. I will check my annotations once again and update here on the status of the training.

MyVanitar commented 7 years ago

@vj-1988

You have no idea about making a IMDB file of the custom image Dataset? it is one the input training parameters here:

https://github.com/sanghoon/pva-faster-rcnn/tree/master/models

vj-1988 commented 7 years ago

There is an explanation here regarding IMDB creation for custom datasets.

MyVanitar commented 7 years ago

@vj-1988

Thanks , I read that briefly. one question pops up. here: ( https://github.com/sanghoon/pva-faster-rcnn/tree/master/models/pvanet/example_train ) we see that the input IMDB file has information for both training and validation in one file, as they named it voc_2007_trainval

Does this create the same file?

vj-1988 commented 7 years ago

For VOC dataset, train.txt, val.txt and trainval.txt are already given along with the dataset.

screenshot_20170227_183154

So we have to generate trainva.txt for our custom dataset and use it. It contains the filenames of train and val images in JPEGImages folder without file extension.

vj-1988 commented 7 years ago

@smg478 : Thank you. I am training a new custom dataset which was carefully annotated and I am not getting Nan for loss anymore.

charlesYMX commented 6 years ago

the solution for me is rewrite gt_roidb() and _load_pascal_annotation() (these two function at pascal_voc.py)

why?

because my train data in xml file is not right ,so i bring some wrong number into the train process:

here is what i write code(just some code i change ,not all the code in that file ):

gt_roidb():

   while index < len(self.image_index):
        annotation = self._load_pascal_annotation(self.image_index[index])
        if annotation is not None:
            gt_roidb.append(annotation)
            index += 1
        else :
            removeCount += 1
            del self.image_index[index]
    print "removed count :{}" .format(removeCount,("-"*30))
    print 'mage number before remove {} {}'.format(len(self.image_index), ("-" * 30))
    image_set_file = os.path.join(self._data_path, 'ImageSets', 'Main',
                                  self._image_set + '.txt')
    os.remove(image_set_file) # remove image_index txt file
    with open(image_set_file, 'a+') as f: #rewirte imge index in txt file
        for i, val_item in enumerate(self.image_index):
            f.write(val_item)
            f.write('\n')

_load_pascal_annotation():

    for ix, obj in enumerate(objs):
        bbox = obj.find('bndbox')
        # Make pixel indexes 0-based
        x1 = float(bbox.find('xmin').text)
        y1 = float(bbox.find('ymin').text)
        x2 = float(bbox.find('xmax').text)
        y2 = float(bbox.find('ymax').text)
        if x1 > x2 :
            temp = x1
            x1 = x2
            x2 = temp
        if y1 > y2 :
            temp = y2
            y2 = y1
            y1 = temp
        if x1 < 0 : x1 = 0
        if y1 < 0 : y1 = 0
        if x2 > img_width : x2 = img_width
        if y2 > img_height : y2 = img_height

        if x1 >= x2 or y1 >= y2 :
            print "%s, %d, %d, %d, %d"%(index, x1, y1, x2, y2)
            continue