matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.54k stars 11.68k forks source link

Training on combined coco and visual genome dataset and then got nan loss. #2294

Open Askfk opened 4 years ago

Askfk commented 4 years ago

Hello, recently I am building a network that can produce both masks and bounding box level captions. I refer to the mask rcnn and densecap which are all built based on Faster rcnn. I get rid of the bounding box and class ids parts for single object and add bounding box and captions for bounding box level captioning.

For training the model, I used two rpn model with exactly the same architecture and one for masking, another for captioning. So for the two rpn models I got 4 losses: "mask_rpn_score_loss, mask_rpn_bbox_loss, caption_rpn_score_loss, caption_rpn_bbox_loss", all of the operations (data generator, rpn model architecture, loss function) are designed the same as that in mask rcnn.

When I train the whole model with Adam optimizer with an initial learning rate as 0.001, I found that after training for several steps, the caption_rpn_bbox_loss became nan suddenly, while other losses still decrease normally. I cannot figure out what leads to this problem.

There may be no mistakes in computing rpn_gt_bbox, rpn_gt_scores, pred_rpn_bbox and pred_rpn_scores. So I guess maybe there are some dirty data in my dataset like zero size bounding box, but after checking I still cannot find any clues that prove there are dirty data in the dataset.

Below is the ipynb used to combine coco together with visual genome dataset and filter out invalid data.

inspect_dataset.ipynb

Could anyone give any advice about how this nan loss occurs or how to solve it?

Askfk commented 4 years ago

I also tried Nadam, Adagrad, RMSProp and SGD optimizers, the caption_rpn_bbox_loss still became nan suddenly after several steps training except SGD. While SGD is trapped in local minimal and cannot show acceptable training performance.

Askfk commented 4 years ago

The former problem was solved after I reduce the learning rate to less than 0.0001.

In previous work, I took both positive and negative data into account when calculating the final box loss, final caption loss. After getting rid of negative data, I got nan outputs on caption and box losses no matter what kind of optimizer I use and how low the learning rate is.

I still cannot find out what leads to nan loss, but I add

loss = K.siwtch(tf.math.is_nan(loss), tf.constant([0.0]), tf.reduce_mean(loss))

in loss function before return the loss to avoid feeding unexpected loss back to the model. And so for the other losses can also use this way to bypass nan losses or other unexpected losses.

Still have no idea how this change influences the training performance.

ahpu2014 commented 3 years ago

Hi, did you had resovled this problem?