ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config

minda163 commented 6 years ago

System information

What is the top-level directory of the model you are using: models
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):N
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):windows 10
TensorFlow installed from (source or binary):pip
TensorFlow version (use command below):1.8
Bazel version (if compiling from source):
CUDA/cuDNN version:9.0 7.4
GPU model and memory: GTX 1050 2G
Exact command to reproduce: python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config You can collect some of this information using our environment capture script:

Describe the problem

python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config You can collect some of this information using our environment capture script:

Source code / logs

(base) D:\tensorflow3\models\research\object_detection>python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config WARNING:tensorflow:From F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py:126: main (from __main__) is deprecated and will be removed in a future version. Instructions for updating: Use object_detection/model_main.py. W1005 21:39:48.133514 17976 tf_logging.py:126] From F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py:126: main (from __main__) is deprecated and will be removed in a future version. Instructions for updating: Use object_detection/model_main.py. WARNING:tensorflow:From D:\tensorflow3\models\research\object_detection\legacy\trainer.py:265: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step W1005 21:39:48.211594 17976 tf_logging.py:126] From D:\tensorflow3\models\research\object_detection\legacy\trainer.py:265: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W1005 21:39:48.227216 17976 tf_logging.py:126] num_readers has been reduced to 1 to match input file shards. Traceback (most recent call last): File "legacy/train.py", line 184, in <module> tf.app.run() File "F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py", line 250, in new_func return func(*args, **kwargs) File "legacy/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "D:\tensorflow3\models\research\object_detection\legacy\trainer.py", line 290, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "D:\tensorflow3\models\research\slim\deployment\model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "D:\tensorflow3\models\research\object_detection\legacy\trainer.py", line 180, in _create_losses train_config.use_multiclass_scores) ValueError: not enough values to unpack (expected 7, got 0)

netanel-s commented 6 years ago

I think this might be because of compatibility issues. Try training with model_main.py. It's almost identical input-wise.

pkulzc commented 6 years ago

That config by default is expecting 8 replicas. Set this to 1 and re-try it.

nathantsoi commented 5 years ago

I am trying to train this on a single P100 (16gb) with the new model_main.py script.

OOM is raised unless I lower the batch size to 8. I have tried both replicas_to_aggregate: 1 and sync_replicas: false (together and separately)

  batch_size: 8
  sync_replicas: false
  startup_delay_steps: 0
  replicas_to_aggregate: 1

Here is the command I'm using:

export PIPELINE_CONFIG_FILE=ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_batch_size_8.config
export PROJ_ROOT=$HOME/src/nn/tf-retinanet
export MODEL_DIR=${PROJ_ROOT}/results/retinanet-l1s-batchsize-8/
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
export NUM_TRAIN_STEPS=50000
pipenv run python models/research/object_detection/model_main.py \
    --pipeline_config_path=${PROJ_ROOT}/config/${PIPELINE_CONFIG_FILE} \
    --model_dir=${MODEL_DIR} \
    --num_train_steps=${NUM_TRAIN_STEPS} \
    --sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
    --alsologtostderr

Am I missing something here?

pkulzc commented 5 years ago

@nathantsoi sync_replicas and replicas_to_aggregate are not really used for model_main.py ( they are used by deprecated train.py). You get this OOM simply because you don't have that much memory.

junjieliwhu commented 5 years ago

That config by default is expecting 8 replicas. Set this to 1 and re-try it.

thanks,it works

tensorflow / models