tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

faster_rcnn_nas_coco ValueError setting --num_clones #4504

Closed austinmw closed 6 years ago

austinmw commented 6 years ago

Not sure if this is a bug or not since I can run other models fine.

System information

I've previously tried two models: ssd_mobilenet_v1_coco_2017_11_17 and faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 which will both run like this:

run_train()
{
export CUDA_VISIBLE_DEVICES=0,1,2
python3 /home/ubuntu/training/train.py --logtostderr --pipeline_config_path=/home/ubuntu/training/faster_rcnn_nas_coco.config --train_dir=/home/ubuntu/training/models/train --num_clones=3 --ps_tasks=1
unset CUDA_VISIBLE_DEVICES
}

I have 4 GPU's so I've been setting the first three to train and the last one to eval. However, for some reason I'm unable to do the same for the model faster_rcnn_nas_coco_2018_01_28. When I try to set --num_clones=3 I get the error:

WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards. Traceback (most recent call last): File "/home/awelch/training/train.py", line 184, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/home/awelch/training/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/tensorflow/models/research/object_detection/trainer.py", line 285, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/tensorflow/models/research/slim/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/tensorflow/models/research/object_detection/trainer.py", line 177, in _create_losses train_config.use_multiclass_scores) ValueError: not enough values to unpack (expected 7, got 0)

Could anyone please explain why this is or how I can fix it so that I can run this model with more than 1 GPU?

tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Bazel version

ImEric commented 6 years ago

Hi @austinmw , setting the batch size from '1' to '1*num_clones' in training configs helps me solve this problem. I believe it's due to line 265 in object_detection/trainer.py batch_size = train_config.batch_size // num_clones

jixiaonanzhuaizhuai commented 6 years ago

@ImEric Hi, I can run other models fine, but this is a bug when I run faster_rcnn_nas_coco, can you help me?

The error message: File "legacy/train.py", line 184, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 250, in new_func return func(*args, **kwargs) File "legacy/train.py", line 93, in main FLAGS.pipeline_config_path) File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow/models-master/research/object_detection/utils/config_util.py", line 94, in get_configs_from_pipeline_file text_format.Merge(proto_str, pipeline_config) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 533, in Merge descriptor_pool=descriptor_pool) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 587, in MergeLines return parser.MergeLines(lines, message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 620, in MergeLines self._ParseOrMerge(lines, message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 635, in _ParseOrMerge self._MergeField(tokenizer, message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 735, in _MergeField merger(tokenizer, message, field) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 823, in _MergeMessageField self._MergeField(tokenizer, sub_message) File "/home/ubuntu/.local/lib/python3.5/site-packages/google/protobuf/text_format.py", line 703, in _MergeField (message_descriptor.full_name, name)) google.protobuf.text_format.ParseError: 143:1 : Message type "object_detection.protos.EvalConfig" has no field named "eval_input_reader".

my protoc --version : 3.5.1

my tensorflow : 1.6.0

my GUP : -----------------------------------------------------------------------------+ | NVIDIA-SMI 384.130 Driver Version: 384.130 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 On | N/A | | 51% 84C P2 156W / 250W | 10539MiB / 11170MiB | 95% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 46% 68C P8 22W / 250W | 10786MiB / 11172MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A | | 62% 85C P2 102W / 250W | 10788MiB / 11172MiB | 91% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A | | 30% 49C P8 18W / 250W | 10592MiB / 11172MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1574 G /usr/lib/xorg/Xorg 84MiB | | 0 28929 C python 10343MiB | | 0 37191 G compiz 95MiB | | 0 46611 G /opt/teamviewer/tv_bin/TeamViewer 11MiB | | 1 18207 C python3 10775MiB | | 2 20236 C python3 10775MiB | | 3 16837 C python 10579MiB | +----------------------------------------------------------------------------