Closed alexmagsam closed 6 years ago
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. Bazel version
The program terminates without throwing any errors or system messages? Since it seems to end while filling up up the shuffle buffer, perhaps try lowering the shuffle_buffer_size
value.
Yes the output I posted was copy and pasted. Where can I set the shuffle_buffer_size
? Can you refer me to documentation or the file where it is set? This setting is not listed in my configuration file.
The configuration instructions are here: https://github.com/tensorflow/models/blob/b9ca525f88cd942882ca541ec5ac9d27bb87a24f/research/object_detection/g3doc/configuring_jobs.md
Taking a look at the InputReader proto, there's an optional field called "shuffle_buffer_size".
So, for example, you can set the train_input_reader
field in the config file like this:
train_input_reader: {
shuffle_buffer_size: 1024
}
If you have further questions about configuring the pipeline, Stack Overflow would be the best place to ask. If you are still encountering issues that you believe is a bug or feature request, feel free to open another issue.
Hi @alexmagsam, did you solve your problem with "Filling up shuffle buffer"?
Partially, yes. I added shuffle: false to the train_input_reader field like so
train_input_reader: {
shuffle: false
}
But now I receive a different error after training begins.
File "model_main.py", line 101, in <module>
tf.app.run()
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "model_main.py", line 97, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
return executor.run()
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
return self.run_local()
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\training.py", line 669, in run_local
hooks=train_hooks)
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1135, in _train_model_default
saving_listeners)
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "C:\Users\Alexm\my-venv\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1209, in run
run_metadata=run_metadata))
File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
Hello, @alexmagsam, Did you ever find a solution to your problem as I am experiencing the same thing currently?
System information
Issue
The training session seizes to begin. The process is terminated without throwing any helpful errors. I have opened tensorboard in the training/ directory, but there is no training going on. This information can also be found at https://stackoverflow.com/questions/51754386/tensorflow-object-detection-training-issue, but I feel this issue is better suited for this forum.
Background info
File structure
object_detection/data/train.record
object_detection/data/eval.record
object_detection/data/spheroid_label_map.pbtxt
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt.data- 00000-of-00001
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt.index
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt.meta
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/faster_rcnn_resnet 101_coco.config
object_detection/training/
Output
Config file