google-research / ssl_detection

Semi-supervised learning for object detection
Apache License 2.0
408 stars 76 forks source link

the total_cost and wd_cost become nan. #9

Closed bobzhang123 closed 4 years ago

bobzhang123 commented 4 years ago

hi, when I test your code with train_stg1.sh and compute the teacher model. the logs show that the total_cost and wd_cost become Nan, I did not change any code. the data and the gpu is as follows: DATASET='coco_train2017.1@10' CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 image

zizhaozhang commented 4 years ago

There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.

bobzhang123 commented 4 years ago

There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.

Thank you for your reply. In my case, It actually have a bad affect on my result. After the model is trained, when I run 'eval_stg1.sh' , the results eg mAP, AP50... all become zero.

zizhaozhang commented 4 years ago

From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?

If not, could you pls list all the exact commands you run. I can not inspect based on currrent information

bobzhang123 commented 4 years ago

From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?

If not, could you pls list all the exact commands you run. I can not inspect based on currrent information

Hi, I try to train the code again! this is my training script in 'train_stg1.sh': image the 25th epch: image the 26th epch: image the 27th epoch: image

the total_cost and wd_cost diverged in 26th epoch, the evaluation result when reaching 40 epoches is as follows: image

I also change the learning rate from 1e-2 to 1e-3, and still meet the same problem.

zizhaozhang commented 4 years ago

Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest

  1. whether your tensorflow version (1.14) mets our listed requirement
  2. whether you follow the right data prepartion.

No need to tune parameteres to avoid this issue, the default parameters should work.

bobzhang123 commented 4 years ago

Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest

  1. whether your tensorflow version (1.14) mets our listed requirement
  2. whether you follow the right data prepartion.

No need to tune parameteres to avoid this issue, the default parameters should work.

Hi, I have solved the porblem, I used the released branch 'dependabot/pip/tensorflow-gpu-1.15.2', and also my tf version is 1.15.2. Now,I use the master branch, I downgrage the tf version to 1.14.0 and the python version to 3.6.8, the NaN problem and the OOM problem is solved! It seems that the released branch 'dependabot/pip/tensorflow-gpu-1.15.2' is not robust and still have some bugs.