Loss is nan - Githubissues

AA12321 commented 5 years ago

I'm trying to fine tuning the _fasterrcnn_resnet50fpn model using the scripts in vision/references/detection/.

As sanity check first of all i have trained the model with train.py script with all default parameters on COCO2017, except for --pretrained -b 1 -j 4 and --device cuda (on Gtx 1070).

Getting the follow error: {'loss_classifier': tensor(nan, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(inf, device='cuda:0', grad_fn=<DivBackward0>)}

Another test was done reducing the learning rate to 0.0001 but the same error was raised, only after more time. To find the cause of problem i have done the same test without shuffle sample on training, and surprising there isn't a specific sample with wrong bbox (the dataset used is the original COCO2017, the scripts automatically exclude wrong samples).

Configuration: CUDA 9.1 adn Cudnn 7 on ubuntu 18.04 lts

fmassa commented 5 years ago

This looks a lot like duplicate from https://github.com/pytorch/vision/issues/1128 https://github.com/pytorch/vision/issues/1120 and https://github.com/pytorch/vision/issues/997

Could you try seeing if the solutions in those issues apply to your problem? I.e., seeing if you don't have a box with a zero width / height somewhere?

fmassa commented 5 years ago

Closing due to inactivity and because there is nothing actionable we can do

pytorch / vision

Loss is nan #1176