tensorflow / tpu

Reference models and tools for Cloud TPUs.
https://cloud.google.com/tpu/
Apache License 2.0
5.21k stars 1.77k forks source link

Mask RCNN - NAN values in tensor #376

Open lezardvt opened 5 years ago

lezardvt commented 5 years ago

Hi,

I am training Mask RCNN on my own dataset. The model trains and I get the expected results after training (About 2000 samples in my dataset). The problem that I am having is that after X steps I get the following error: Gradient for resnet50/batch_normalization_35/beta:0 is NaN : Tensor had NaN values

This is my config:

checkpoint: gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603 num_classes: 32 init_learning_rate: 0.08 backbone: 'resnet50' use_bfloat16: True train_batch_size: 8 eval_batch_size: 8 training_file_pattern: gs://my/path/TFRecords/train- validation_file_pattern: gs://my/path/TFRecords/val- val_json_file: gs://my/path/val_annotations.json total_steps: 5000 num_steps_per_eval: 60 eval_samples: 116

I use 64 shards when generating TFRecords.

I have played with the training rate and batch normalization settings: BN seems to delay the error a bit and the same goes for lower training rates. I have also tried other backbone networks and I get the same results.

Any suggestions?

sayradley commented 5 years ago

Have you managed to solve the issue? I'm facing the same problem.

lezardvt commented 5 years ago

I have not solved it completely but there are several workarounds:

Firstly double check your training data. If you have errors in your data you will get tensors with nan.

If you are sure that your data is ok try these options:

Option 1: Start with a very low training rate - in my case I start with 0.005. You will have to find this value through trail and error. So train for example 100 steps at a time and decrease your training rate until training becomes stable. Once the model reaches a certain accuracy this may happen again. Again decrease your training rate until stable. Once you have a feel for your required training rates you can set up a training schedule in your config.

Option 2: If you are using your own data start training without a checkpoint.

Option 3: Experiment with gradient clipping.

Hope this helps. If you do find another solution please post it here