Nan loss during calculating smooth_l1_loss

andreydung commented 5 years ago

I turned on torch.autograd.set_detect_anomaly and found that nan loss happened during backward calculation of smooth_l1_loss:

rois_label, adja_loss, adjr_loss = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/data/hkrm_training/lib/model/HKRM/faster_rcnn_HKRM.py", line 272, in forward
    RCNN_loss_bbox = _smooth_l1_loss(bbox_pred, rois_target, rois_inside_ws, rois_outside_ws)
  File "/home/ubuntu/data/hkrm_training/lib/model/utils/net_utils.py", line 101, in _smooth_l1_loss
    in_loss_box = torch.pow(in_box_diff, 2) * (sigma_2 / 2.) * smoothL1_sign \

Traceback (most recent call last):
  File "trainval_HKRM.py", line 370, in <module>
    loss.backward()
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

I'm wondering if there is a way to fix this problem?

IceSuger commented 5 years ago

I turned on torch.autograd.set_detect_anomaly and found that nan loss happened during backward calculation of smooth_l1_loss:

rois_label, adja_loss, adjr_loss = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/data/hkrm_training/lib/model/HKRM/faster_rcnn_HKRM.py", line 272, in forward
    RCNN_loss_bbox = _smooth_l1_loss(bbox_pred, rois_target, rois_inside_ws, rois_outside_ws)
  File "/home/ubuntu/data/hkrm_training/lib/model/utils/net_utils.py", line 101, in _smooth_l1_loss
    in_loss_box = torch.pow(in_box_diff, 2) * (sigma_2 / 2.) * smoothL1_sign \

Traceback (most recent call last):
  File "trainval_HKRM.py", line 370, in <module>
    loss.backward()
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

I'm wondering if there is a way to fix this problem?

I've met the same problem, did you solve that?

hezhu1996 commented 4 years ago

@andreydung @IceSuger I have the same problem, Have you guys solved it? thanks!

IceSuger commented 4 years ago

@andreydung @IceSuger I have the same problem, Have you guys solved it? thanks!

Sorry, I could not recall the details, but as what I can remember, it seems that the nan was because of some dirty data in my dataset. Hope it helps you.

hezhu1996 commented 4 years ago

@andreydung @IceSuger I have the same problem, Have you guys solved it? thanks!

Sorry, I could not recall the details, but as what I can remember, it seems that the nan was because of some dirty data in my dataset. Hope it helps you.

Hi, do you use the original coco data? cause I use the official coco data...
Actually, can you tell me How to locate the dirty data? I trained the first epoch is ok but somewhere in the second epoch it becomes to null... thanks :)

IceSuger commented 4 years ago

@andreydung @IceSuger I have the same problem, Have you guys solved it? thanks!

Sorry, I could not recall the details, but as what I can remember, it seems that the nan was because of some dirty data in my dataset. Hope it helps you.

Hi, do you use the original coco data? cause I use the official coco data... Actually, can you tell me How to locate the dirty data? I trained the first epoch is ok but somewhere in the second epoch it becomes to null... thanks :)

I didn't use the COCO dataset, and my training fails in the first epoch..... so I guess your nan may not came from dirty data.... Maybe considering some numerical underflow or overflow ? Just try to check your network output tensor, see whether values in it are all within proper range.

hezhu1996 commented 4 years ago

oh....ok, thx anyway

Xiao Teng notifications@github.com于2020年5月12日周二下午10:31写道：

@andreydung https://github.com/andreydung @IceSuger https://github.com/IceSuger I have the same problem, Have you guys solved it? thanks!

Sorry, I could not recall the details, but as what I can remember, it seems that the nan was because of some dirty data in my dataset. Hope it helps you.

Hi, do you use the original coco data? cause I use the official coco data... Actually, can you tell me How to locate the dirty data? I trained the first epoch is ok but somewhere in the second epoch it becomes to null... thanks :)

I didn't use the COCO dataset, and my training fails in the first epoch..... so I guess your nan may not came from dirty data.... Maybe considering some numerical underflow or overflow ? Just try to check your network output tensor, see whether values in it are all within proper range.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jwyang/faster-rcnn.pytorch/issues/554#issuecomment-627707440, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSAQ5KUTZC2KHFFEE3V3GTRRIA7VANCNFSM4HORJXJQ .

jwyang / faster-rcnn.pytorch

Nan loss during calculating smooth_l1_loss #554