faster_rcnn_nas_coco ValueError setting --num_clones

austinmw commented 6 years ago

Not sure if this is a bug or not since I can run other models fine.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): have added PR curves from here: https://github.com/tensorflow/models/issues/3081
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Nvidia-docker, Ubuntu 16.04
TensorFlow installed from (source or binary): Conda tensorflow-gpu
TensorFlow version (use command below): 1.8.0
CUDA/cuDNN version: 9.0/7
GPU model and memory: (4) GTX 1080

Exact command to reproduce:

run_train()
{
export CUDA_VISIBLE_DEVICES=0,1,2
python3 /home/ubuntu/training/train.py --logtostderr --pipeline_config_path=/home/ubuntu/training/faster_rcnn_nas_coco.config --train_dir=/home/ubuntu/training/models/train --num_clones=3 --ps_tasks=1
unset CUDA_VISIBLE_DEVICES
}

Describe the problem

I've previously tried two models: ssd_mobilenet_v1_coco_2017_11_17 and faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 which will both run like this:

run_train()
{
export CUDA_VISIBLE_DEVICES=0,1,2
python3 /home/ubuntu/training/train.py --logtostderr --pipeline_config_path=/home/ubuntu/training/faster_rcnn_nas_coco.config --train_dir=/home/ubuntu/training/models/train --num_clones=3 --ps_tasks=1
unset CUDA_VISIBLE_DEVICES
}

I have 4 GPU's so I've been setting the first three to train and the last one to eval. However, for some reason I'm unable to do the same for the model faster_rcnn_nas_coco_2018_01_28. When I try to set --num_clones=3 I get the error:

WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. Traceback (most recent call last): File "/home/awelch/training/train.py", line 184, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/home/awelch/training/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/tensorflow/models/research/object_detection/trainer.py", line 285, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/tensorflow/models/research/slim/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/tensorflow/models/research/object_detection/trainer.py", line 177, in _create_losses train_config.use_multiclass_scores) ValueError: not enough values to unpack (expected 7, got 0)

Could anyone please explain why this is or how I can fix it so that I can run this model with more than 1 GPU?

tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Bazel version

ImEric commented 6 years ago

Hi @austinmw , setting the batch size from '1' to '1*num_clones' in training configs helps me solve this problem. I believe it's due to line 265 in object_detection/trainer.py batch_size = train_config.batch_size // num_clones

jixiaonanzhuaizhuai commented 6 years ago

@ImEric Hi, I can run other models fine, but this is a bug when I run faster_rcnn_nas_coco, can you help me?

The error message: File "legacy/train.py", line 184, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func return func(*args, **kwargs) File "legacy/train.py", line 93, in main FLAGS.pipeline_config_path) File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow/models-master/research/object_detection/utils/config_util.py", line 94, in get_configs_from_pipeline_file text_format.Merge(proto_str, pipeline_config) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 533, in Merge descriptor_pool=descriptor_pool) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 587, in MergeLines return parser.MergeLines(lines, message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 620, in MergeLines self._ParseOrMerge(lines, message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 635, in _ParseOrMerge self._MergeField(tokenizer, message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 735, in _MergeField merger(tokenizer, message, field) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 823, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 703, in _MergeField (message_descriptor.full_name, name)) google.protobuf.text_format.ParseError: 143:1 : Message type "object_detection.protos.EvalConfig" has no field named "eval_input_reader".

my protoc --version : 3.5.1

my tensorflow : 1.6.0

tensorflow / models

faster_rcnn_nas_coco ValueError setting --num_clones #4504

System information

Describe the problem