maacofficial / tensorflow-train-object-detection-classiffier-tutorial

In this tutorial,i am going to show you how to train own tensorflow object detection classifier on Windows 10.
Apache License 2.0
5 stars 2 forks source link

train.py çalıştırılırken hata #3

Closed poyrazaktas closed 4 years ago

poyrazaktas commented 4 years ago

Merhabalar, Programı çalıştırırken son aşamada çözümünü bulamadığım bir hata alıyorum. Bu komutu çalıştırıyorum:

python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_pets.config

C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\trainer.py:210: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:depth of additional conv before box predictor: 0 WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\core\box_predictor.py:380: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\core\losses.py:339: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py:1670: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead. C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " WARNING:tensorflow:From C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\clip_ops.py:110: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py:736: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-05-11 23:35:43.547184: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 INFO:tensorflow:Restoring parameters from C:/tensorflow1/models/research/object_detection/faster_rcnn_inception_v2_coco_2018_01_28/model.ckpt INFO:tensorflow:Starting Session. INFO:tensorflow:Saving checkpoint to path training/model.ckpt INFO:tensorflow:Starting Queues. INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:Recording summary at step 0. INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "train.py", line 163, in tf.app.run() File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 263, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 704, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op op_def=op_def) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Traceback (most recent call last): File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_call return fn(*args) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1329, in _run_fn status, run_metadata) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 163, in tf.app.run() File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 332, in train saver=saver) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 767, in train sess, train_op, global_step, train_step_kwargs) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 487, in train_step run_metadata=run_metadata) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 895, in run run_metadata_ptr) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1344, in _do_run options, run_metadata) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1363, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "train.py", line 163, in tf.app.run() File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 263, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 704, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op op_def=op_def) File "C:\Users\Poyraz.conda\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Bu kadar uzun şekilde çıktıyı attığım için üzgünüm, sıkıntının bu kısımda olduğunu tahmin etmeme rağmen emin olamadığım için tüm çıktıyı attım.

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Bu hatanın çözümünde böyle bir sıralama sunmuşlar fakat ben bunu nasıl koda ekleyeceğimi anlamadım. Yardımcı olursanız çok sevinirim.

maacofficial commented 4 years ago

Yanlış dataset'i indirmiş olma ihtimaliniz var. Bu hata ondan kaynaklanıyor olabilir. faster_rcnn_inception_v2_coco datasetini indirip object_detection klasörünün içine attığına emin misin? Son olarak da "object_detection" klasörünün içine dataset'i çıkardığında "faster_rcnn_inception_v2_coco_2018_01_28" isimli bir klasör olduğuna dikkat et.

poyrazaktas commented 4 years ago

Evet eminim.

maacofficial commented 4 years ago

Evet eminim.

"faster_rcnn_inception_v2_coco_2018_01_28" isimli bir klasör olduğuna dikkat et.

poyrazaktas commented 4 years ago

Evet öyle bir klasör olduğuna da eminim

maacofficial commented 4 years ago

C:\tensorflow1\models\research\object_detection\training\faster_rcnn_inception_v2_pets.config dosyasının 110. satırındaki fine_tune_checkpoint: "C:/tensorflow1/models/research/object_detection/faster_rcnn_inception_v2_coco_2018_01_28/model.ckpt" doğru şekilde düzenleyip kaydettiğini kontrol eder misin?

poyrazaktas commented 4 years ago

Evet yazdığınız dosya yoluyla aynı, ayrı bi düzenleme yapmadım

maacofficial commented 4 years ago

Harici ekran kartınız var mı? Varsa modeli nedir?

maacofficial commented 4 years ago

Projede paylaştığım resimlerimi kullanıyorsunuz yoksa şuan kendiniz resim çekip mi attınız. Kendi labelmaplerinizi mi kullanıyorsunuz?

poyrazaktas commented 4 years ago

Notebook kullanıyorum 4GB GDDR5 GTX 1050 ekran kartı var.

maacofficial commented 4 years ago

Notebook kullanıyorum 4GB GDDR5 GTX 1050 ekran kartı var.

Tamam o zaman ekran kartıyla yapmanızı öneririm bu işlemi bunun için. pip uninstall tensorflow pip install tensorflow-gpu==1.5.0 komutlarını sırasıyla çalıştırmanız yeterli olur ekran kartınız bu iş için de gayet yeterli.

poyrazaktas commented 4 years ago

Kendi datasetimi kullanıyorum. Şunu da belirtiyim test dosyasına farklı dosyalar koydum ve bir adımlığına çalıştırdı.

poyrazaktas commented 4 years ago

Ben de şu an onu deniyorum cuda ve cudnn kurulumu gerekiyor sanırım. Onları yükledikten sonra mı deniyim yoksa direk kullanılabilir mi?

maacofficial commented 4 years ago

Evet cuda ve cudnn driverlerını kurmalısınız. https://www.tensorflow.org/install/source#gpu uyumlu sürüme buradan bakabilirsiniz.

maacofficial commented 4 years ago

Kendi datasetimi kullanıyorum. Şunu da belirtiyim test dosyasına farklı dosyalar koydum ve bir adımlığına çalıştırdı.

Kendi resimlerinizi kastettiniz sanırım. Eğer kendiniz resim kullanıyorsanız. çektiğiniz fotoğrafların boyutunun 200KB ve çözünürlüklerinin ise 720x1280 değerlerinden düşük olmasına dikkat edin.

poyrazaktas commented 4 years ago

Yine aynı hatayı aldım, resimlerimin çözünürlüğünün hepsi aynı 480x640 ve en yüksek boyutlu resim dosyası 66KB büyüklüğünde Ben daha önce belirttiğim issue'daki çözüm yolumun tf record oluşturmamda hatalar ürettiğini düşünüyorum o indisi 5 yapmanın bir sonucu olabilir bu aldığım hata ama çözümünü bulamadım.

maacofficial commented 4 years ago

Hata kodunu paylaşır mısınız? Farklılık yok mu?

poyrazaktas commented 4 years ago

İsteğinize ilişkin hata kodu burada, işlemi sadece bir kez yapmış başka bir farklılık göremedim:

(tensorflow1) C:\tensorflow1\models\research\object_detection>python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_pets.config

C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\trainer.py:210: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:depth of additional conv before box predictor: 0 WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\core\box_predictor.py:380: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\core\losses.py:339: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py:1670: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead. C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " WARNING:tensorflow:From C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\clip_ops.py:110: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py:736: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-05-12 02:02:41.937474: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 INFO:tensorflow:Restoring parameters from training/model.ckpt-0 INFO:tensorflow:Starting Session. INFO:tensorflow:Saving checkpoint to path training/model.ckpt INFO:tensorflow:Starting Queues. INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:Recording summary at step 0. INFO:tensorflow:global step 1: loss = 5.2393 (7.884 sec/step) INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "train.py", line 163, in tf.app.run() File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 263, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 704, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_call return fn(*args) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1329, in _run_fn status, run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 163, in tf.app.run() File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 332, in train saver=saver) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 767, in train sess, train_op, global_step, train_step_kwargs) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 487, in train_step run_metadata=run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 895, in run run_metadata_ptr) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run feed_dict_tensor, options, run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1344, in _do_run options, run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1363, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "train.py", line 163, in tf.app.run() File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 263, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 704, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

maacofficial commented 4 years ago

object_detection/training/faster_rcnn_inception_v2_pets.config dosyasında 85. satırdaki batch size değerini 16 yapmayı dener misiniz? Daha sonra tekrar eğitmeyi deneyin.

maacofficial commented 4 years ago

object_detection/training/faster_rcnn_inception_v2_pets.config dosyasında 85. satırdaki batch size değerini 16 yapmayı dener misiniz? Daha sonra tekrar eğitmeyi deneyin.

Eğer bu da işe yaramazsa bu durum bazen de kendi yüklediğiniz resim içindeki nesnenin resme oran çok küçük olması örneğin 30x35 pixel boyutunda olmasından kaynaklanıyor olabilir ama bu küçük bir ihtimal. LabelImg ile tüm resimlerinizi de etiketlediniz değil mi?

poyrazaktas commented 4 years ago

(tensorflow1) C:\tensorflow1\models\research\object_detection>python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2_pets.config

Bu sefer uzun bir süre boyunca bekledi INFO:tensorflow kısmında ama malesef yine aynı hatayı verdi:

C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\trainer.py:210: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:Scale of 0 disables regularizer. INFO:tensorflow:depth of additional conv before box predictor: 0 WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\core\box_predictor.py:380: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\core\losses.py:339: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From C:\tensorflow1\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py:1670: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead. C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " WARNING:tensorflow:From C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\clip_ops.py:110: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead WARNING:tensorflow:From C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py:736: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-05-12 02:16:57.923856: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 INFO:tensorflow:Restoring parameters from training/model.ckpt-0 INFO:tensorflow:Starting Session. INFO:tensorflow:Saving checkpoint to path training/model.ckpt INFO:tensorflow:Starting Queues. INFO:tensorflow:global_step/sec: 0 INFO:tensorflow:Recording summary at step 0. INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "train.py", line 163, in tf.app.run() File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 263, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 704, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_call return fn(*args) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1329, in _run_fn status, run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 163, in tf.app.run() File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 332, in train saver=saver) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 767, in train sess, train_op, global_step, train_step_kwargs) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 487, in train_step run_metadata=run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 895, in run run_metadata_ptr) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1128, in _run feed_dict_tensor, options, run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1344, in _do_run options, run_metadata) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\client\session.py", line 1363, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "train.py", line 163, in tf.app.run() File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 124, in run _sys.exit(main(argv)) File "train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "C:\tensorflow1\models\research\object_detection\trainer.py", line 263, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 704, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 3160, in create_op op_def=op_def) File "C:\ProgramData\Anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\framework\ops.py", line 1625, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[Node: CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

poyrazaktas commented 4 years ago

object_detection/training/faster_rcnn_inception_v2_pets.config dosyasında 85. satırdaki batch size değerini 16 yapmayı dener misiniz? Daha sonra tekrar eğitmeyi deneyin.

Eğer bu da işe yaramazsa bu durum bazen de kendi yüklediğiniz resim içindeki nesnenin resme oran çok küçük olması örneğin 30x35 pixel boyutunda olmasından kaynaklanıyor olabilir ama bu küçük bir ihtimal. LabelImg ile tüm resimlerinizi de etiketlediniz değil mi?

Evet etiketledim ve çok küçük etiketlemelerim var. Yani tahminen 30x35 piksel çözünürlükten küçük etiketlemelerim var Örneğin: xml dosyasındaki bir objenin pikselleri bu şekilde

-<object>

<name>six</name>

<pose>Unspecified</pose>

<truncated>0</truncated>

<difficult>0</difficult>

<occluded>0</occluded>

-<bndbox>

<xmin>196</xmin>

<xmax>209</xmax>

<ymin>209</ymin>

<ymax>234</ymax>

</bndbox>

</object>

</annotation>
maacofficial commented 4 years ago

object_detection/training/faster_rcnn_inception_v2_pets.config dosyasında 85. satırdaki batch size değerini 16 yapmayı dener misiniz? Daha sonra tekrar eğitmeyi deneyin.

Eğer bu da işe yaramazsa bu durum bazen de kendi yüklediğiniz resim içindeki nesnenin resme oran çok küçük olması örneğin 30x35 pixel boyutunda olmasından kaynaklanıyor olabilir ama bu küçük bir ihtimal. LabelImg ile tüm resimlerinizi de etiketlediniz değil mi?

Evet etiketledim ve çok küçük etiketlemelerim var. Yani tahminen 30x35 piksel çözünürlükten küçük etiketlemelerim var

batch size değerini eski haline getirip küçük etiketli resimlerinizi xml verileri ile birlikte deneme amaçlı klasör dışına başka bir yere çıakrtıp deneyin xml to csv ve train record generate işlemlerini tekrar yapmayı unutmayın.

maacofficial commented 4 years ago

object_detection/training/faster_rcnn_inception_v2_pets.config dosyasında 85. satırdaki batch size değerini 16 yapmayı dener misiniz? Daha sonra tekrar eğitmeyi deneyin.

Eğer bu da işe yaramazsa bu durum bazen de kendi yüklediğiniz resim içindeki nesnenin resme oran çok küçük olması örneğin 30x35 pixel boyutunda olmasından kaynaklanıyor olabilir ama bu küçük bir ihtimal. LabelImg ile tüm resimlerinizi de etiketlediniz değil mi?

Evet etiketledim ve çok küçük etiketlemelerim var. Yani tahminen 30x35 piksel çözünürlükten küçük etiketlemelerim var Örneğin: xml dosyasındaki bir objenin pikselleri bu şekilde

-<object>

<name>six</name>

<pose>Unspecified</pose>

<truncated>0</truncated>

<difficult>0</difficult>

<occluded>0</occluded>

-<bndbox>

<xmin>196</xmin>

<xmax>209</xmax>

<ymin>209</ymin>

<ymax>234</ymax>

</bndbox>

</object>

</annotation>

Evet gerçekten küçükmüş bu sebep oluyor olabilir. Objelerin resim içinde küçük olması bir şey kazandırmayacaktır. Önemli olan çeşitli şekillerde çeşitli arkaplanlarda resimlerinizin olması. Umarım sorununuz çözülür iyi geceler dilerim.

poyrazaktas commented 4 years ago

object_detection/training/faster_rcnn_inception_v2_pets.config dosyasında 85. satırdaki batch size değerini 16 yapmayı dener misiniz? Daha sonra tekrar eğitmeyi deneyin.

Eğer bu da işe yaramazsa bu durum bazen de kendi yüklediğiniz resim içindeki nesnenin resme oran çok küçük olması örneğin 30x35 pixel boyutunda olmasından kaynaklanıyor olabilir ama bu küçük bir ihtimal. LabelImg ile tüm resimlerinizi de etiketlediniz değil mi?

Evet etiketledim ve çok küçük etiketlemelerim var. Yani tahminen 30x35 piksel çözünürlükten küçük etiketlemelerim var

batch size değerini eski haline getirip küçük etiketli resimlerinizi xml verileri ile birlikte deneme amaçlı klasör dışına başka bir yere çıakrtıp deneyin xml to csv ve train record generate işlemlerini tekrar yapmayı unutmayın.

İncelememe göre sanırım resimlerimin çoğu bu şekilde etiketlenmiş, etiketleme işine tekrar döneceğim anlaşılan teşekkür ederim emekleriniz için iyi geceler.

maacofficial commented 4 years ago

object_detection/training/faster_rcnn_inception_v2_pets.config dosyasında 85. satırdaki batch size değerini 16 yapmayı dener misiniz? Daha sonra tekrar eğitmeyi deneyin.

Eğer bu da işe yaramazsa bu durum bazen de kendi yüklediğiniz resim içindeki nesnenin resme oran çok küçük olması örneğin 30x35 pixel boyutunda olmasından kaynaklanıyor olabilir ama bu küçük bir ihtimal. LabelImg ile tüm resimlerinizi de etiketlediniz değil mi?

Evet etiketledim ve çok küçük etiketlemelerim var. Yani tahminen 30x35 piksel çözünürlükten küçük etiketlemelerim var

batch size değerini eski haline getirip küçük etiketli resimlerinizi xml verileri ile birlikte deneme amaçlı klasör dışına başka bir yere çıakrtıp deneyin xml to csv ve train record generate işlemlerini tekrar yapmayı unutmayın.

İncelememe göre sanırım resimlerimin çoğu bu şekilde etiketlenmiş, etiketleme işine tekrar döneceğim anlaşılan teşekkür ederim emekleriniz için iyi geceler.

Resimlerinizin çözünürlüğü biraz daha büyük olabilir. Bu sıkıntı oluşturmayacaktır. Rica ederim ne demek tekrardan iyi geceler