Closed bobzhang123 closed 4 years ago
There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.
There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.
Thank you for your reply. In my case, It actually have a bad affect on my result. After the model is trained, when I run 'eval_stg1.sh' , the results eg mAP, AP50... all become zero.
From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?
If not, could you pls list all the exact commands you run. I can not inspect based on currrent information
From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?
If not, could you pls list all the exact commands you run. I can not inspect based on currrent information
Hi, I try to train the code again! this is my training script in 'train_stg1.sh': the 25th epch: the 26th epch: the 27th epoch:
the total_cost and wd_cost diverged in 26th epoch, the evaluation result when reaching 40 epoches is as follows:
I also change the learning rate from 1e-2 to 1e-3, and still meet the same problem.
Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest
No need to tune parameteres to avoid this issue, the default parameters should work.
Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest
- whether your tensorflow version (1.14) mets our listed requirement
- whether you follow the right data prepartion.
No need to tune parameteres to avoid this issue, the default parameters should work.
Hi, I have solved the porblem, I used the released branch 'dependabot/pip/tensorflow-gpu-1.15.2', and also my tf version is 1.15.2. Now,I use the master branch, I downgrage the tf version to 1.14.0 and the python version to 3.6.8, the NaN problem and the OOM problem is solved! It seems that the released branch 'dependabot/pip/tensorflow-gpu-1.15.2' is not robust and still have some bugs.
hi, when I test your code with train_stg1.sh and compute the teacher model. the logs show that the total_cost and wd_cost become Nan, I did not change any code. the data and the gpu is as follows: DATASET='coco_train2017.1@10' CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7