google / automl

Google Brain AutoML
Apache License 2.0
6.21k stars 1.45k forks source link

NaN loss during training. #404

Open hongrui16 opened 4 years ago

hongrui16 commented 4 years ago

Use standard file utilities to get mtimes. ERROR:tensorflow:Model diverged with loss = NaN. E0513 16:02:25.483811 139845663950592 basic_session_run_hooks.py:760] Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. E0513 16:02:25.939540 139845663950592 error_handling.py:75] Error recorded from training_loop: NaN loss during training. WARNING:tensorflow:Reraising captured error W0513 16:02:25.939864 139845663950592 error_handling.py:135] Reraising captured error

i changed the following parameters h.learning_rate = 0.08 => 0.001 h.lr_warmup_init = 0.008 => 0.0001

it did not work

@fsx950223 i use the latest version of master branch.

fsx950223 commented 4 years ago

Use standard file utilities to get mtimes. ERROR:tensorflow:Model diverged with loss = NaN. E0513 16:02:25.483811 139845663950592 basic_session_run_hooks.py:760] Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. E0513 16:02:25.939540 139845663950592 error_handling.py:75] Error recorded from training_loop: NaN loss during training. WARNING:tensorflow:Reraising captured error W0513 16:02:25.939864 139845663950592 error_handling.py:135] Reraising captured error

i changed the following parameters h.learning_rate = 0.08 => 0.001 h.lr_warmup_init = 0.008 => 0.0001

it did not work

@fsx950223 i use the latest version of master branch.

Estimator Dump Hook could help you.

hongrui16 commented 4 years ago

Use standard file utilities to get mtimes. ERROR:tensorflow:Model diverged with loss = NaN. E0513 16:02:25.483811 139845663950592 basic_session_run_hooks.py:760] Model diverged with loss = NaN. ERROR:tensorflow:Error recorded from training_loop: NaN loss during training. E0513 16:02:25.939540 139845663950592 error_handling.py:75] Error recorded from training_loop: NaN loss during training. WARNING:tensorflow:Reraising captured error W0513 16:02:25.939864 139845663950592 error_handling.py:135] Reraising captured error i changed the following parameters h.learning_rate = 0.08 => 0.001 h.lr_warmup_init = 0.008 => 0.0001 it did not work @fsx950223 i use the latest version of master branch.

Estimator Dump Hook could help you.

Thank you. @fsx950223 what did you mean with ' Estimator Dump Hook could help you.'? could you explain more specifically. ps. i use tf 1.15

fsx950223 commented 4 years ago

Reference tf1 debug tutorial.

elv-xuwen commented 4 years ago

Hi I met the same issue when training D2 on a custom dataset with batch_size==4, tried h.learning_rate = 0.08 => 0.001/0.01 h.lr_warmup_init = 0.008 => 0.0001/0.001 did not work. Have you solved it?

wenh06 commented 4 years ago

Check your bounding box annotations to see if there are zero (or even negative) area boxes, which might have been created by mistake.

Several months ago, I encountered this NaN loss error when I used the tensorflow object detection API, and finally found that a few of my bounding boxes had zero area. This kind of error is hard to find using annotation tools like labelImg. My practice is to gather all xml annotations into csv file(s) (or DataFrame equivalently), and check by, for example, (df['area'].values>0).all()

landskris commented 3 years ago

Depending on your GPU, try changing mixed_precision: false in the config yaml file. This did it for me when using google colab which does not seem to have GPU computation capability > 7, and normally does not benefit computationally from mixed_precision. More info (https://www.tensorflow.org/guide/mixed_precision?hl=en)