google / automl

Google Brain AutoML
Apache License 2.0
6.21k stars 1.45k forks source link

Training on a custom dataset, not converge #437

Open elv-xuwen opened 4 years ago

elv-xuwen commented 4 years ago

I trained D2 on COCO with batch_size==8 and learning_rate==0.08 and it works well. But when I'm training D2 on a custom dataset with batch_size==4 and learning_rate==0.08, I got an error:

WARNING:tensorflow:From /home/elv-xuwen/object_detection/automl/efficientdet/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py:971: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to delete files with this prefix. W0521 09:19:47.019784 140008692971328 deprecation.py:323] From /home/elv-xuwen/object_detection/automl/efficientdet/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py:971: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to delete files with this prefix. ERROR:tensorflow:Model diverged with loss = NaN. E0521 09:39:58.914585 140008692971328 basic_session_run_hooks.py:770] Model diverged with loss = NaN. WARNING:tensorflow:Reraising captured error W0521 09:39:59.938261 140008692971328 error_handling.py:149] Reraising captured error Traceback (most recent call last): File "main.py", line 413, in app.run(main) File "/home/elv-xuwen/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/elv-xuwen/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "main.py", line 380, in main steps=int(FLAGS.num_examples_per_epoch / FLAGS.train_batch_size)) File "/home/elv-xuwen/object_detection/automl/efficientdet/venv/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 771, in after_run raise NanLossDuringTrainingError

After I changed learning_rate==0.001, the error doesn't occur but loss doesn't converge. image

mingxingtan commented 4 years ago

Hi @elv-xuwen

NaN is a complicated issue, I add some hints in this section: https://github.com/google/automl/blob/master/efficientdet/g3doc/faq.md#12-why-i-see-nan-during-my-training-and-how-to-debug-it

elv-xuwen commented 4 years ago

Hi @mingxingtan , thank you for your reply! I tried all the hints (except increasing batchsize) but none of them works. I can't increase my batchsize due to the memory issue, so I can only use batchsize==4. What is the batchsize and other hyperparameters you used for D2? And do you have any plans to train on Open Image Dataset?

rcg12387 commented 4 years ago

Tan wrote in his paper: Each model is trained 300 epochs with batch total size 128 on 32 TPUv3 cores.

Pari-singh commented 4 years ago

I tried all the hints with batch size 16, including reducing the LR, still getting the NaN error