tensorflow / models

Models and examples built with TensorFlow
Other
77.21k stars 45.75k forks source link

LossTensor is inf or nan while training ssd_mobilenet_v1_coco model in my own dataset #3688

Closed Bidski closed 4 years ago

Bidski commented 6 years ago

I am having issues similar to #1881 and #1907.

Using google object_detection api and the latest tensorflow master repo built with CUDA 9.1 on Linux Mint 18.2 (bases on Ubuntu Xenial).

Have I written custom code: No, but custom dataset OS Platform and Distribution: Linux Mint 18.2 (based on Ubuntu 16.04) TensorFlow installed from: Tensorflow built and installed from github master TensorFlow version: 1.8.0-rc0-cp35-cp35m-linux_x86_64 Bazel version: 0.12.0 CUDA/cuDNN version: CUDA 9.1, cuDNN 7.1 GPU model and memory: GTX1080Ti 11GB Exact command to reproduce: cd tensorflow/models/research && python3 object_detection/train.py --logtostderr --pipeline_config_path=/path/to/pipeline_config.pbtxt --train_dir=/path/to/train/folder

Describe the problem

I am trying to fine-tune the ssd mobilenet v1 coco model using my own dataset. I am using the default config file that is provided in the object detection repository.

From the very first global step of training I receive the "LossTensor is inf or nan. : Tensor had NaN values" error. Things I have tried:

From what I can tell, these are all of the things that were suggested in #1881 and #1907, none of these have worked for me.

Source code / logs

INFO:tensorflow:Restoring parameters from /media/bidski/Portable/imagetagger/tf_objapi/models/ssd_mobilenet_v1_coco_2017_11_17/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /media/bidski/Portable/imagetagger/tf_objapi/models/ssd_mobilenet/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 31.4505 (12.442 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

Caused by op 'CheckNumerics', defined at:
  File "object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/bidski/Projects/models/research/object_detection/trainer.py", line 288, in train
    total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 734, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3303, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1669, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1313, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1421, in _call_tf_sessionrun
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/bidski/Projects/models/research/object_detection/trainer.py", line 360, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 906, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1141, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]

Caused by op 'CheckNumerics', defined at:
  File "object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/home/bidski/Projects/models/research/object_detection/trainer.py", line 288, in train
    total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 734, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3303, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1669, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](AddN/_4851)]]
tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

Bidski commented 6 years ago

Updated to make requested details more obvious

hustc12 commented 6 years ago

I've met this issue too. Although I used faster_rcnn_resnet01_pets to train my own dataset. I've learned from someone who said this might be caused by the small pictures, such as 15x30 pixels in the training set. I will try to remove these samples in my dataset and train it again. If any update, I will post it here.

UPDATE: After an investigation, I found the small size of samples are not the root cause of the crash. (I even tried with very small size of the samples, such as 5x5 pixels, for the training. At least in the first beginning of 200 steps, no crash happening.) Actually, what I found were, in the annotation file, the wrong order of the coordinates. For instance, the annotations will mark the coordinates, named x1, y1, x2 and y2. Here, x1 should less than x2, so do y1 and y2. However, in my case, some of the annotated samples show that x1>x2, or y1>y2. which cause the crash issue. After I correct the order of the coordinates, crash gone. Hope this information can help someone.

Bidski commented 6 years ago

I just re-checked my dataset. I have no entries where x1 > x2 or `y1 > y2. However, this error still persists for me.

hustc12 commented 6 years ago

@Bidski , when did you encounter this crash during the training? At the very beginning or running after a while? Actually, I am thinking whoever meet this crash, somehow it should be related to the dataset (maybe it caused by the wrong order of the coordinates, or may be caused by the invalid coordinates' values). Maybe you can split the data set into pieces for training to check in which part of the data will cause this crash happen again. In this way, I think at least you can determine the scope of crash-related data, so that you might be find what is the real problem to cause your crash.

myuanz commented 6 years ago

I have the same problem, and I tried

ssd_mobilenet_v1_coco,   
ssd_mobilenet_v1_ppn,  
ssd_mobilenet_v2_coco,  
ssdlite_mobilenet_v2_coco

They also can't work, but if I use faster_rcnn_inception_v2_coco, it works well.

xtianhb commented 6 years ago

I've similar problem. Maybe related to: #4881

mawanda-jun commented 6 years ago

Very same problem here, trying @myuanz set of NN and tensorflow r1.8 with gpu support compiled in Windows 10 Win64.

carlosfab commented 6 years ago

I had this problem. It was solved when I checked:

  1. (xmin, xmax) < width ; (ymin, ymax) < height
  2. (xmin < xmax) ; (ymin < ymax)
  3. if object_area <= (width * height) / (16 * 16): raise Exception('object too small Error')
tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

savhascelik commented 3 years ago

@carlosfab you are absolutely right. You saved the day. Thank you