tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

resume training for object detection #5213

Closed junweima closed 6 years ago

junweima commented 6 years ago

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

I'm training ssdlite model with mobileNetV2 on coco dataset, everything is default. The training was interrupted and killed and now I want to restore the model and resume training.

However, the current script does not support this feature. I tried to set fine_tune_checkpoint_type: "detection", the model is loaded but the script only runs evaluation without training. Is there a way to resume training?

I've seen this https://github.com/tensorflow/models/issues/4116 post but I'm not sure which part was modified and train.py seemed to be moved to legacy folder now.

Specifically, I just want to ask 1. Is there a feature for resuming training? If so, which flags should I set? 2. If there is no such feature, could you give some pointers on how to add this feature? I can make a PR if needed.

Thanks.

junweima commented 6 years ago

Hi, I solved it by passing a larger value or set None for NUM_TRAIN_STEPS and NUM_EVAL_STEPS. I didn't know train_steps is set to max_steps in function create_train_and_eval_specs in model_lib.py

train_spec = tf.estimator.TrainSpec( input_fn=train_input_fn, max_steps=train_steps)