Closed minda163 closed 5 years ago
I think this might be because of compatibility issues. Try training with model_main.py. It's almost identical input-wise.
That config by default is expecting 8 replicas. Set this to 1 and re-try it.
I am trying to train this on a single P100 (16gb) with the new model_main.py
script.
OOM is raised unless I lower the batch size to 8. I have tried both replicas_to_aggregate: 1
and sync_replicas: false
(together and separately)
batch_size: 8
sync_replicas: false
startup_delay_steps: 0
replicas_to_aggregate: 1
Here is the command I'm using:
export PIPELINE_CONFIG_FILE=ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_batch_size_8.config
export PROJ_ROOT=$HOME/src/nn/tf-retinanet
export MODEL_DIR=${PROJ_ROOT}/results/retinanet-l1s-batchsize-8/
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
export NUM_TRAIN_STEPS=50000
pipenv run python models/research/object_detection/model_main.py \
--pipeline_config_path=${PROJ_ROOT}/config/${PIPELINE_CONFIG_FILE} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr
Am I missing something here?
@nathantsoi sync_replicas and replicas_to_aggregate are not really used for model_main.py ( they are used by deprecated train.py). You get this OOM simply because you don't have that much memory.
That config by default is expecting 8 replicas. Set this to 1 and re-try it.
thanks,it works
System information
Describe the problem
python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config You can collect some of this information using our environment capture script:
Source code / logs
(base) D:\tensorflow3\models\research\object_detection>python legacy/train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config WARNING:tensorflow:From F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py:126: main (from __main__) is deprecated and will be removed in a future version. Instructions for updating: Use object_detection/model_main.py. W1005 21:39:48.133514 17976 tf_logging.py:126] From F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py:126: main (from __main__) is deprecated and will be removed in a future version. Instructions for updating: Use object_detection/model_main.py. WARNING:tensorflow:From D:\tensorflow3\models\research\object_detection\legacy\trainer.py:265: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step W1005 21:39:48.211594 17976 tf_logging.py:126] From D:\tensorflow3\models\research\object_detection\legacy\trainer.py:265: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.create_global_step WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. W1005 21:39:48.227216 17976 tf_logging.py:126] num_readers has been reduced to 1 to match input file shards. Traceback (most recent call last): File "legacy/train.py", line 184, in <module> tf.app.run() File "F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "F:\my_install\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py", line 250, in new_func return func(*args, **kwargs) File "legacy/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "D:\tensorflow3\models\research\object_detection\legacy\trainer.py", line 290, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "D:\tensorflow3\models\research\slim\deployment\model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "D:\tensorflow3\models\research\object_detection\legacy\trainer.py", line 180, in _create_losses train_config.use_multiclass_scores) ValueError: not enough values to unpack (expected 7, got 0)