sanghoon / pva-faster-rcnn

Demo code for PVANet
650 stars 241 forks source link

Training loss is nan for custom dataset #58

Closed vj-1988 closed 7 years ago

vj-1988 commented 7 years ago

@sanghoon : Hi,

I am trying to train my custom dataset with 64 classes using pvanet. I am able to start the training using the following command

'./tools/ --gpu 1 --solver models/pvanet/example_train/solver.prototxt --imdb voc_2007_trainval --iters 70000 --cfg models/pvanet/cfgs/train.yml --weights models/pvanet/example_train/pva9.1_pretrained_no_fc6.caffemodel'

The training loss starts around 5.

'I0201 18:00:12.837128 22329 solver.cpp:238] Iteration 0, loss = 5.07012 I0201 18:00:12.837157 22329 solver.cpp:254] Train net output #0: loss_bbox = 0.0719297 ( 1 = 0.0719297 loss) I0201 18:00:12.837162 22329 solver.cpp:254] Train net output #1: loss_cls = 4.31714 ( 1 = 4.31714 loss) I0201 18:00:12.837165 22329 solver.cpp:254] Train net output #2: rpn_loss_bbox = 0.000328277 ( 1 = 0.000328277 loss) I0201 18:00:12.837169 22329 solver.cpp:254] Train net output #3: rpn_loss_cls = 0.692441 ( 1 = 0.692441 loss) I0201 18:00:12.837173 22329 sgd_solver.cpp:138] Iteration 0, lr = 0.0001 /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/fast_rcnn/ RuntimeWarning: invalid value encountered in log targets_dh = np.log(gt_heights / ex_heights) /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/rpn/ VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False) /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/rpn/ VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False) /home/drive/Softwares/PVA-frcnn-new/pva-faster-rcnn/tools/../lib/rpn/ VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future labels[fg_rois_per_this_image:] = 0 I0201 18:00:21.564218 22329 solver.cpp:238] Iteration 20, loss = nan I0201 18:00:21.564246 22329 solver.cpp:254] Train net output #0: loss_bbox = 0.0315711 ( 1 = 0.0315711 loss) I0201 18:00:21.564251 22329 solver.cpp:254] Train net output #1: loss_cls = 0.850022 ( 1 = 0.850022 loss) I0201 18:00:21.564256 22329 solver.cpp:254] Train net output #2: rpn_loss_bbox = 0.00812243 ( 1 = 0.00812243 loss) I0201 18:00:21.564260 22329 solver.cpp:254] Train net output #3: rpn_loss_cls = 0.690506 ( 1 = 0.690506 loss) I0201 18:00:21.564265 22329 sgd_solver.cpp:138] Iteration 20, lr = 0.0001 '

Then the loss immediately jumps to nan. I tried reducing the learning rate as low as 0.000001. But the problem persists. Has anyone else faced this situation?

nynorbert commented 7 years ago


I have similiar problem with a custom dataset but I don't know how to solve yet. Do you have any progress with this problem?

In my case sometimes my rpn_loss_bbox also "nan":

I0207 21:43:03.959089 12331 solver.cpp:229] Iteration 640, loss = nan I0207 21:43:03.959159 12331 solver.cpp:245] Train net output #0: loss_bbox = 0 ( 1 = 0 loss) I0207 21:43:03.959174 12331 solver.cpp:245] Train net output #1: loss_cls = 0 ( 1 = 0 loss) I0207 21:43:03.959187 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.653614 ( 1 = 0.653614 loss) I0207 21:43:03.959197 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = nan ( 1 = nan loss) I0207 21:43:03.959209 12331 sgd_solver.cpp:106] Iteration 640, lr = 0.0001 I0207 21:43:23.685860 12331 solver.cpp:229] Iteration 660, loss = nan I0207 21:43:23.685931 12331 solver.cpp:245] Train net output #0: loss_bbox = 0.000342935 ( 1 = 0.000342935 loss) I0207 21:43:23.685945 12331 solver.cpp:245] Train net output #1: loss_cls = 0.111643 ( 1 = 0.111643 loss) I0207 21:43:23.685956 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.809447 ( 1 = 0.809447 loss) I0207 21:43:23.685966 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = 1.3496 ( 1 = 1.3496 loss) I0207 21:43:23.685976 12331 sgd_solver.cpp:106] Iteration 660, lr = 0.0001 I0207 21:43:43.506032 12331 solver.cpp:229] Iteration 680, loss = nan I0207 21:43:43.506108 12331 solver.cpp:245] Train net output #0: loss_bbox = 0 ( 1 = 0 loss) I0207 21:43:43.506120 12331 solver.cpp:245] Train net output #1: loss_cls = 0 ( 1 = 0 loss) I0207 21:43:43.506134 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.733904 ( 1 = 0.733904 loss) I0207 21:43:43.506145 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = nan ( 1 = nan loss) I0207 21:43:43.506156 12331 sgd_solver.cpp:106] Iteration 680, lr = 0.0001 I0207 21:44:03.204167 12331 solver.cpp:229] Iteration 700, loss = nan I0207 21:44:03.204233 12331 solver.cpp:245] Train net output #0: loss_bbox = 0 ( 1 = 0 loss) I0207 21:44:03.204252 12331 solver.cpp:245] Train net output #1: loss_cls = 0 ( 1 = 0 loss) I0207 21:44:03.204260 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.795648 ( 1 = 0.795648 loss) I0207 21:44:03.204272 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = nan ( 1 = nan loss) I0207 21:44:03.204288 12331 sgd_solver.cpp:106] Iteration 700, lr = 0.0001 I0207 21:44:23.145709 12331 solver.cpp:229] Iteration 720, loss = 0.902763 I0207 21:44:23.145797 12331 solver.cpp:245] Train net output #0: loss_bbox = 2.89623e-05 ( 1 = 2.89623e-05 loss) I0207 21:44:23.145810 12331 solver.cpp:245] Train net output #1: loss_cls = 0.0984289 ( 1 = 0.0984289 loss) I0207 21:44:23.145823 12331 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.562016 ( 1 = 0.562016 loss) I0207 21:44:23.145830 12331 solver.cpp:245] Train net output #3: rpn_loss_bbox = 0.27673 ( 1 = 0.27673 loss)

I haven't used any pretrained and my learning rate is: 0.0001

vj-1988 commented 7 years ago

@nynorbert: I haven't solved the issue. But if we use pretrained model, I guess at least the mbox loss wouldn't be nan.

Can some one confirm if we have to use a pretrained model that is also trained from our custom dataset classes like we usually do in py-faster rcnn.

MyVanitar commented 7 years ago

@vj-1988 @nynorbert

I don't have the answer to any of these questions and it is very good that you come-up to this step without these information.

vj-1988 commented 7 years ago

@VanitarNordic : We don't create LMDB in pvanet training. I followed the steps in this page

MyVanitar commented 7 years ago

@vj-1988 if we have our own custom image dataset, then when the model should be trained on? I want to use PVANET+ (Compressed)

smg478 commented 7 years ago

I had this issue. I found one of my annotation .xml files was wrongly annotated (negative value in coordinates). I corrected the annotation and after that training was fine. No nan values anymore.

vj-1988 commented 7 years ago

@smg478 : Thank you. I will check my annotations once again and update here on the status of the training.

MyVanitar commented 7 years ago


You have no idea about making a IMDB file of the custom image Dataset? it is one the input training parameters here:

vj-1988 commented 7 years ago

There is an explanation here regarding IMDB creation for custom datasets.

MyVanitar commented 7 years ago


Thanks , I read that briefly. one question pops up. here: ( ) we see that the input IMDB file has information for both training and validation in one file, as they named it voc_2007_trainval

Does this create the same file?

vj-1988 commented 7 years ago

For VOC dataset, train.txt, val.txt and trainval.txt are already given along with the dataset.


So we have to generate trainva.txt for our custom dataset and use it. It contains the filenames of train and val images in JPEGImages folder without file extension.

vj-1988 commented 7 years ago

@smg478 : Thank you. I am training a new custom dataset which was carefully annotated and I am not getting Nan for loss anymore.

charlesYMX commented 6 years ago

the solution for me is rewrite gt_roidb() and _load_pascal_annotation() (these two function at


because my train data in xml file is not right ,so i bring some wrong number into the train process:

here is what i write code(just some code i change ,not all the code in that file ):


   while index < len(self.image_index):
        annotation = self._load_pascal_annotation(self.image_index[index])
        if annotation is not None:
            index += 1
        else :
            removeCount += 1
            del self.image_index[index]
    print "removed count :{}" .format(removeCount,("-"*30))
    print 'mage number before remove {} {}'.format(len(self.image_index), ("-" * 30))
    image_set_file = os.path.join(self._data_path, 'ImageSets', 'Main',
                                  self._image_set + '.txt')
    os.remove(image_set_file) # remove image_index txt file
    with open(image_set_file, 'a+') as f: #rewirte imge index in txt file
        for i, val_item in enumerate(self.image_index):


    for ix, obj in enumerate(objs):
        bbox = obj.find('bndbox')
        # Make pixel indexes 0-based
        x1 = float(bbox.find('xmin').text)
        y1 = float(bbox.find('ymin').text)
        x2 = float(bbox.find('xmax').text)
        y2 = float(bbox.find('ymax').text)
        if x1 > x2 :
            temp = x1
            x1 = x2
            x2 = temp
        if y1 > y2 :
            temp = y2
            y2 = y1
            y1 = temp
        if x1 < 0 : x1 = 0
        if y1 < 0 : y1 = 0
        if x2 > img_width : x2 = img_width
        if y2 > img_height : y2 = img_height

        if x1 >= x2 or y1 >= y2 :
            print "%s, %d, %d, %d, %d"%(index, x1, y1, x2, y2)