Open elv-xuwen opened 4 years ago
Hi @elv-xuwen
NaN is a complicated issue, I add some hints in this section: https://github.com/google/automl/blob/master/efficientdet/g3doc/faq.md#12-why-i-see-nan-during-my-training-and-how-to-debug-it
Hi @mingxingtan , thank you for your reply! I tried all the hints (except increasing batchsize) but none of them works. I can't increase my batchsize due to the memory issue, so I can only use batchsize==4. What is the batchsize and other hyperparameters you used for D2? And do you have any plans to train on Open Image Dataset?
Tan wrote in his paper: Each model is trained 300 epochs with batch total size 128 on 32 TPUv3 cores.
I tried all the hints with batch size 16, including reducing the LR, still getting the NaN error
I trained D2 on COCO with batch_size==8 and learning_rate==0.08 and it works well. But when I'm training D2 on a custom dataset with batch_size==4 and learning_rate==0.08, I got an error:
After I changed learning_rate==0.001, the error doesn't occur but loss doesn't converge.