Closed AA12321 closed 5 years ago
This looks a lot like duplicate from https://github.com/pytorch/vision/issues/1128 https://github.com/pytorch/vision/issues/1120 and https://github.com/pytorch/vision/issues/997
Could you try seeing if the solutions in those issues apply to your problem? I.e., seeing if you don't have a box with a zero width / height somewhere?
Closing due to inactivity and because there is nothing actionable we can do
I'm trying to fine tuning the _fasterrcnn_resnet50fpn model using the scripts in vision/references/detection/.
As sanity check first of all i have trained the model with
train.py
script with all default parameters on COCO2017, except for --pretrained -b 1 -j 4 and --device cuda (on Gtx 1070).Getting the follow error:
{'loss_classifier': tensor(nan, device='cuda:0', grad_fn=<NllLossBackward>), 'loss_box_reg': tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), 'loss_rpn_box_reg': tensor(inf, device='cuda:0', grad_fn=<DivBackward0>)}
Another test was done reducing the learning rate to 0.0001 but the same error was raised, only after more time. To find the cause of problem i have done the same test without shuffle sample on training, and surprising there isn't a specific sample with wrong bbox (the dataset used is the original COCO2017, the scripts automatically exclude wrong samples).
Configuration: CUDA 9.1 adn Cudnn 7 on ubuntu 18.04 lts