NaN loss when I was training resnet101 on Pascal Voc 2007

jwyang / faster-rcnn.pytorch

A faster pytorch implementation of faster r-cnn

MIT License

7.68k stars 2.33k forks source link

NaN loss when I was training resnet101 on Pascal Voc 2007 #236

Open xuw080 opened 6 years ago

xuw080 commented 6 years ago

I used the vanilla codes and prepare pascal voc 2007 datasets as suggested methods. I set learning rate as 0.01 and use 8 gpus, 24 batch size for training. However, the rcnn_cls loss will be Nan after training only 2 iters. My pytorch version is: 0.3.0.post4. I also tried to train resnet50 and got similar problems. Even though I used 1 gpu and 1 batch size, and lr=0.001, this problem still appear. But VGG16 works fine. I am not sure what happened, anyone can provide some suggestions?

XudongWang12Sigma commented 6 years ago

Recently, I found that if I did not fix BN layers, I will not get NaN loss, but the final results will be worse. Once I fix bn layers as original codes, NaN loss will appear. Anyone get any suggestions? @jwyang

ljtruong commented 6 years ago

@xuw080 have you checked the annotations?

refer to this issue. https://github.com/jwyang/faster-rcnn.pytorch/issues/136#issuecomment-390544655

xuw080 commented 6 years ago

Thank you so much for your reply. I have checked the annotation, because it is original pascal voc datasets, I did not modify any lines of codes, so, the annotation should be fine. Also, if I used vgg16 as backbone, problem will not appear, results are the same as reported one.

jwyang commented 6 years ago

@xuw080 it might happen due to a new version of pytorch. Try to do some gradient clamp for resnet 101 at this line

xuw080 commented 6 years ago

Could you plz tell me which pytorch version do you use for running your codes? I am using 0.3.post4. And it will lead NaN errors.

ljtruong commented 6 years ago

@xuw080 I'm pytorch 0.4 with resnet. After making those adjustments on my dataset I was able to remove the NaN errors

JingXiaolun commented 6 years ago

@Worulz,what adjustments have you done?I need remove the NaN errors?

ljtruong commented 6 years ago

@1csu. if you're receiving NaN errors. It's to do with your annotations. Refer to this comment to fix it. https://github.com/jwyang/faster-rcnn.pytorch/issues/136#issuecomment-390544655

Even when using default settings for pascal voc I received NaN results. I would change the data load or as stated, check your annotations.

ly19965 commented 6 years ago

@xuw080 I have a same problem like yours,if you change lr from 1e-3 to 1e-4 ,the loss will be not nan,but the mAP will be 56 ,besides ,if you delete the batchnorm layer in lib/models/fast-rcnn/resnet.py ,lr equals to 1e-3 will be trained well ,but mAP is also bad,almost 30.but it can work well.Therefore, it has problem in batchnorm layer.if you can make it well ,please send email to 987752424@qq.com,thanks.

XudongWang12Sigma commented 6 years ago

If you didn't fix BN layer, you will not get NAN loss. However, this is not what original codes tends to do. Generally, for training detection datasets, we should set BN layer fixed. I also get similar results, as you did, if I fix BN layers as original codes and set lower learning rate. The mAP is about 60, which is much lower than reported results. @jwyang @ly19965

XudongWang12Sigma commented 6 years ago

Hi Leon, did you try to train Pascal Voc with this code? @Worulz I will still meet NaN loss when I was training Pascal Voc_0712. I have checked data loading codes, I didn't find any bugs. Thank you in advance.

jwyang commented 6 years ago

@XudongWang12Sigma have you solved you problem?

Hackerlil commented 5 years ago

Maybe I have solve this problem,just in pascal_voc.py ,function _load_pascal_annotations,you should comment all -1,and then delete all things in data/cache.and then it works

Hackerlil commented 5 years ago

and you need add some codes about check coordinate,maybe print it will show if the coordinate is right,