Multiple GPU in model_main.py (since there is no more train.py)

waltermaldonado commented 5 years ago

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

What is the top-level directory of the model you are using: research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.10.1-0-g4dcfddc5d1 1.10.1
Bazel version (if compiling from source): NA
CUDA/cuDNN version: V9.0.176
GPU model and memory: 2x Tesla P100 16Gb
Exact command to reproduce: NA

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Greetings,

I would like to know how to proceed to use both of my GPUs for training a Faster R-CNN with NASNet-A featurization model with the model_main.py file included in the object_detection tools now that train.py is gone. If it is not possible, I would like to request this feature or a workaround to make it work.

Thanks in advance.

Burgomehl commented 5 years ago

you may find the train.py in "legacy" folder

waltermaldonado commented 5 years ago

It doesn't work...

It says there is not enough values to unpack (expected 7, got 0).

nigelmathes commented 5 years ago

train.py no longer works, returning the following error:

 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)

We need multiple GPU support for model_main.py if it's going to be the only way to use the Object Detection API.

varun19299 commented 5 years ago

I think you need to change the batch size in the config file. (Batch size here is not per GPU, rather sum of all ; so Batch size = No of GPUs * Batch size per GPU)

train.py no longer works, returning the following error:
 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)
We need multiple GPU support for model_main.py if it's going to be the only way to use the Object Detection API.

rickragv commented 5 years ago

can anybody did multi gpu training using train. py?

nigelmathes commented 5 years ago

I think you need to change the batch size in the config file. (Batch size here is not per GPU, rather sum of all ; so Batch size = No of GPUs * Batch size per GPU)
train.py no longer works, returning the following error:
 File "/tensorflow/models/research/object_detection/legacy/trainer.py", line 180, in _create_losses
    train_config.use_multiclass_scores)
ValueError: not enough values to unpack (expected 7, got 0)

I just tried changing the batch size to 2, and noticed that the second GPU was still not utilized. In fact, the one GPU spun up to ~80% usage on nvidia-smi (as expected), but then the whole system had core dump memory error with the following stack trace:

INFO:tensorflow:loss = 3.4643404, step = 23013
I1009 13:08:23.581555 140509247522560 tf_logging.py:115] loss = 3.4643404, step = 23013
*** Error in `python': double free or corruption (fasttop): 0x00007fc3b0022e90 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fca520eb7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fca520f437a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fca520f853c]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow6Tensor16CopyFromInternalERKS0_RKNS_11TensorShapeE+0xbe)[0x7fca042dca9e]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow25NonMaxSuppressionV3V4Base7ComputeEPNS_15OpKernelContextE+0x7a)[0x7fca0796087a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow13TracingDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xbc)[0x7fca0443c37c]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x63a4bc)[0x7fca044804bc]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x63ae2a)[0x7fca04480e2a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x21a)[0x7fca044ee96a]
/root/.local/share/virtualenvs/virt-iytJWWKq/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x32)[0x7fca044eda12]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fc9fac6dc80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fca524456ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fca5217b41d]

yukw777 commented 5 years ago

I'm seeing the same thing as @nigelmathes

varun19299 commented 5 years ago

Set num_clones = 2

On Wed 10 Oct, 2018, 8:56 AM Peter Yu, notifications@github.com wrote:

I'm seeing the same thing as @nigelmathes https://github.com/nigelmathes

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/5421#issuecomment-428425244, or mute the thread https://github.com/notifications/unsubscribe-auth/AUME5-Qa5O2kD4F48-kjfmWeW05v6afSks5ujWjRgaJpZM4XDMcE .

--

Thank you, Varun.

yukw777 commented 5 years ago

@varun19299 that option is only available for the old train.py which has been deprecated and no longer works as @waltermaldonado and @nigelmathes pointed out. We need an option for the new model_main.py script.

varun19299 commented 5 years ago

I dont think the new model_main.py supports multi-GPU

This is probably because Estimator distribution strategies don’t work with tf.contrib.slim or tf.contrib.layers.

Could one of the maintainers explain this?

Hafplo commented 5 years ago

@varun19299 , @nealwu Until this is solved, could I use clusters (multi-node single-gpu) with the new Estimator module?

pkulzc commented 5 years ago

As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim.

But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly.

pchowdhry commented 5 years ago

Hi @pkulzc, I'm trying to train rcnn using cloudml (object detection api), runtime version 1.9, and if I try and set the num clones option along with train.py, I get an error with an unrecognized option. Is there a specific runtime necessary to work with train.py?

ppwwyyxx commented 5 years ago

My implementation of Faster R-CNN and Mask R-CNN supports multi-gpu and distributed training.

oopsodd commented 5 years ago

@pkulzc I tried to train a quantized model using legacy train.py with multi-GPU, but it seem not to work. I got this when run legacy eval.py: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint. Did it support to train a quantized model with multi-GPU?

toddwyl commented 5 years ago

As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim.

But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly.

@pkulzc What is the meaning of replicas_to_aggregate ? If I want to run on two GPU for ssd training, what is the value should I set the replicas_to_aggregate?

samuel1208 commented 5 years ago

@oopsodd I met the same problem, have you solved it?

FiveMaster commented 5 years ago

@pkulzc What is the meaning of replicas_to_aggregate ? If I want to run on two GPU for ssd training, what is the value should I set the replicas_to_aggregate?

donghyeon commented 5 years ago

@pkulzc Then if I replace all the slim layers with tf.keras.layers, will the model_main.py be able to run on multiple gpus with ease? or should I make other lots of contribution for multi-gpu training? If so, could you give some useful keywords for me to study multi-gpu training and to contribute some codes for this API?

varun19299 commented 5 years ago

imo, the bigger incompatibility would be with slim.argscope

donghyeon commented 5 years ago

@varun19299 There's already a keras model without using any slim codes in this API repository. To see: https://github.com/tensorflow/models/blob/master/research/object_detection/models/keras_applications/mobilenet_v2.py. Have you tested the multi-gpu training with this keras model? Then it would be appreciated if you share your comments about multi-gpu training with estimators. Of course I will try this also in a few days.

oopsodd commented 5 years ago

@samuel1208 sorry, I didn't. How about you?

v-qjqs commented 5 years ago

@donghyeon Have you tested the multi-gpu training with above keras model you mentioned? Thanks very much. For multi GPU training, I have tried removing all slim.argscope related and explicitly re-constructing/re-initializing model network with tf.layers, but it didn't work with multiple gpus. Did you succeed or should I make other lots of efforts for multi-gpu training? Thanks.

Tantael commented 5 years ago

I think it would be nice to treat this as prio- > training on 1 gpu is allmost impossible.

lighTQ commented 5 years ago

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU: Python object_detection/legacy/train.py \ --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config \ --train_dir=/root/research/Datasets/bak/model/myTrain2 \ --worker_replicas=2 \ --num_clones=2 --ps_tasks=1

nigelmathes commented 5 years ago

@lighTQ train.py is legacy at this point, and the API has moved away from it. In fact, using the most updated version of this repository, train.py throws errors.

As far as I can tell, there is still no multi-gpu training in model_main.py.

v-qjqs commented 5 years ago

I haved tried to re-construct model with tf.layers, and I encountered the same issue as https://github.com/tensorflow/tensorflow/issues/23030. (my tf-gpu version: 1.12, windows) Both tf.contrib.layers.l2_regularizer and tf.keras.regularizers.l2 didn't work with tf.layers.dense or conv2d layer when using MirroredStratedy. But when I updated my tensorflow-gpu version from 1.12 to 1.13, it worked with OneDeviceStrategy, and I got another error with MirroredStratedy (since I use windows notebook with only one gpu card). So I guess model_main.py may work with multiple gpus when using tf1.13 and tf.layers (or tf.keras.layers)? Thanks.

farzaa commented 5 years ago

Any updates on this?

Tantael commented 5 years ago

Any updates on this?

I think we need to fix this on our own.

austinmw commented 5 years ago

Really important issue imo! Anyone know how much progress has been made on this?

It seems like anybody who understands this API enough to work on this isn't very concerned with prioritizing the issue, because they just train on TPUs instead! :( Could be misunderstanding the scope, but it's been broken for nearly eight months now.

Hafplo commented 5 years ago

My guess is the team is working full time on migrating to TF2.0 (which won't have slim anyway, not sure about Estimator). @pkulzc any shout out from the owners would be appreciated so we know what to expect.

pkulzc commented 5 years ago

Actually @Hafplo got us right. We are collaborating with a number of internal teams migrate to TF 2.0. We have made some progress on migrating a few models to keras, multi-gpu training will come after that. This is a huge effort because we need to have a complete migration solution before asking users to migrate. Please allow us some more time, I'll keep you updated!

netanel-s commented 5 years ago

@pkulzc , thank you very much for the update! Could you by any chance try to assess the time scope in which multi-gpu training will be available? Is it more like 1 month/3 months/half a year? Thanks a lot, looking forward to it eagerly! :)

cedricve commented 5 years ago

like to have this as well..

CasiaFan commented 5 years ago

It's quite easy to manually modify model_main.py file to support multiple GPUs training in tf 1.13.1 version. What we need to is to add distribute training strategy in estimator RunConfig when defining estimator. Specifically, if there are 2 GPUs available, we need to modify this line of code to following one:

strategy = tf.contrib.distribute.MirroredStrategy(devices=["/device:GPU:0", "/device:GPU:1"])
# or optionally
# strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=2)
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,  train_distribute=strategy)

austinmw commented 5 years ago

@CasiaFan Have you tested this? I thought that, as pkulzc said, "Estimator distribution strategies don’t work with tf.contrib.slim"?

waltermaldonado commented 5 years ago

@CasiaFan, we need confirmation if what you proposed really works, because as stated by @pkulzc, it is not possible:

As @varun19299 said, model_main.py doesn't support multi-GPU due to the fact that Estimator distribution strategies don’t work with tf.contrib.slim.

But you can still train with multi-GPU via legacy train.py. Actually all configs work with both new and legacy binary. Just notice that you need to set replicas_to_aggregate in train_config properly.

Thanks for helping us out and, just for the sake of saving us a little time, have you tested it? If so, may you provide a little more detail (which model did you use, batch configs, etc.)?

varun19299 commented 5 years ago

I believe this was possible even in 1.10.

The bottleneck imo is the slim layers which have to be replaced and the scopes.

On 30-May-2019, at 8:32 AM, Zong Fan notifications@github.com wrote:

It's quite easy to manually modify model_main.py file to support multiple GPUs training in tf 1.13.1 version. What we need to is to add distribute training strategy in estimator RunConfig when defining estimator. Specifically, if there are 2 GPUs available, we need to modify this line https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py#L62 of code to following one:

strategy = tf.contrib.distribute.MirroredStrategy(devices=["/device:GPU:0", "/device:GPU:1"])

or optionally

strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=2)

config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, train_distribute=strategy) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/models/issues/5421?email_source=notifications&email_token=AFBQJZZNJUWBUIMWOCWDIE3PX7XSJA5CNFSM4FYMY4CKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWSUXAA#issuecomment-497372032, or mute the thread https://github.com/notifications/unsubscribe-auth/AFBQJZYD42T34HTCABMLPTDPX7XSJANCNFSM4FYMY4CA.

Thank you, Varun

CasiaFan commented 5 years ago

@waltermaldonado I have tested using official ssd_mobilenet_v2_coco.config in the config directory. The python3 process with PID 38956 on GPU3, 4 is my training process and I set batch_size to 4 just for testing.

So it seems that the slim module could work under tf.contrib.distribute.MirroredStrategy. BTW, I write a mobilenet v3 backbone using tf.keras which works fine with this strategy. However, if I use the on-the-shelf ssd_mobilenet_v2_keras, an error occurs that input_shape received by convolutional_keras_box_predictor.py build function at line 135 is None. I'm working on this error now.

CasiaFan commented 5 years ago

UPDATE: as for the problem I mentioned above when using ssd_mobilenet_v2_keras as feature extractor,

File "/home/arkenstone/models/research/object_detection/predictors/convolutional_keras_box_predictor.py", line 135, in build if len(input_shapes) != len(self._prediction_heads[BOX_ENCODINGS]): TypeError: object of type 'NoneType' has no len()

It seems to be due to the incompatibility between python2 and python3 for what dict.values() returns. Just add feature_maps = list(feature_maps) after extracting feature maps at line 570 in file ssd_meta_arch.py.

paillardf commented 5 years ago

@CasiaFan I can't make model_mail.py work with MirroredStrategy. I am using tensorflow 1.13.1 with ssd_mobilenet_v2_coco. I get this error at start :

Traceback (most recent call last):
  File "model_main.py", line 112, in <module>
    tf.app.run()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 108, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1254, in _actual_train_model_distributed
    self.config))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1199, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 641, in _call_for_each_replica
    return _call_for_each_replica(self._container_strategy(), fn, args, kwargs)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 189, in _call_for_each_replica
    coord.join(threads)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/prog/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 167, in _call_for_each_replica
    merge_args = values.regroup({t.device: t.merge_args for t in threads})
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in regroup
    for i in range(len(v0)))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 997, in <genexpr>
    for i in range(len(v0)))
  File "/home/prog/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/values.py", line 1010, in regroup
    assert set(v.keys()) == v0keys
AssertionError

Any idea for me ? Thank you

CasiaFan commented 5 years ago

@paillardf umm... It seems some nodes in the graph on each devices are different. Do you add some operations on specific device or just use the default object detection api?

paillardf commented 5 years ago

@CasiaFan i didn't change anything in the object_detection folder except the line you gave us. I am up to date with the model repo as well.

ChanZou commented 5 years ago

@CasiaFan Would you please post the commit hash where you clone/last merge from the repo? I am running TF 1.13.1 but getting a different error message ValueError: Variable FeatureExtractor/MobilenetV2/Conv/weights/replica_1/ExponentialMovingAverage/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope? The error message persists among keras and slim models so I think your commit hash will be very helpful for both @paillardf and myself. Thanks.

laksgreen commented 5 years ago

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU: Python object_detection/legacy/train.py --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config --train_dir=/root/research/Datasets/bak/model/myTrain2 --worker_replicas=2 --num_clones=2 --ps_tasks=1

@lighTQ - Thanks, It's working with the options: " --worker_replicas=2 --num_clones=2 --ps_tasks=1"

waltermaldonado commented 5 years ago

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU: Python object_detection/legacy/train.py --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config --train_dir=/root/research/Datasets/bak/model/myTrain2 --worker_replicas=2 --num_clones=2 --ps_tasks=1

@lighTQ - Thanks, It's working with the options: " --worker_replicas=2 --num_clones=2 --ps_tasks=1"

Yes, this is the way to do that with the legacy train scripts. We just can't do the training with multiple GPUs using the model_main.py.

CasiaFan commented 5 years ago

@ChanZou @paillardf Sorry for responding later. This issue seems to be related with combination usage of tf.contrib.distribute.MirroredStrategy and tf.train.ExponentialMovingAverage See #27392 for detailed information. For simplicity, I turn off the use_moving_average during training like this in config:

train_config: {
  batch_size: 2
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
    use_moving_average: false  # add this line
  }

Now I have already upgraded my tf to version 1.4. Only adding the following code will cause problems like this: code

gpu_devices = ["/device:GPU:{}".format(x) for x in range(len(FLAGS.gpus.split(",")))]
strategy = tf.distribute.MirroredStrategy(devices=gpu_devices, 
                                            cross_device_ops=tf.distribute.HierarchicalCopyAllReduce(num_packs=1))
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,
                                  train_distribute=strategy,
                                  save_checkpoints_steps=1500)

error

...
  File "/data/fanzong/miniconda3/envs/tf_cuda10/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 126, in _require_cross_replica_or_default_context_extended
    raise RuntimeError("Method requires being in cross-replica context, use "
RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()

And this error seems to to be caused by scaffold defining saver for EstimatorSpec. When commenting this snippet of code (line 501-509). Training with model_main.py could work fine! I will dig into this problem latter. But since I have added a lot of custom code, I am sorry I cannot push a commit that could work under tf 1.13.1. Previous experiment was implemented on a fresh git cloned branch and I'll be appreciated if you can test it. BTW, as for the slim concern mentioned in previous posts, I have checked the object detection api project roughly (both tf1.13.1 and tf1.14.0 version). Now only small amount of codes use slim and most of them come from slim.net defining network architecture. As for other modules, all have counterparts written in tf.keras. So I don't think slim is the obstacle for multi-gpu training.

ChanZou commented 5 years ago

@CasiaFan Thank you for your sharing, it is super helpful! I made the same discovery and followed #27392 to modify TF code. To add a little bit more info, TowerOptimizer and replicate_model_fn are tempting but using them in model_main.py can be really dangerous. The obvious reason is that they were deprecated more than a year ago. But what makes them more dangerous than that the code might still run. However, they use variable_scope to handle variable reuse and tf.keras seems not to be affected by that. So instead of having one replica of the model I had 4 (one for each GPU) replicas trained at the same time, which led to massive checkpoints and degraded model performance.

harsh306 commented 4 years ago

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU: Python object_detection/legacy/train.py --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config --train_dir=/root/research/Datasets/bak/model/myTrain2 --worker_replicas=2 --num_clones=2 --ps_tasks=1

@lighTQ , @pkulzc if we are scaling on a single machine with 2 GPUs, should we keep
--worker_replicas=1?

 Python object_detection/legacy/train.py 
      - pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config 
         --train_dir=/root/research/Datasets/bak/model/myTrain2 
         --worker_replicas=1
         --num_clones=2 --ps_tasks=1

kjkim-kr commented 4 years ago

@CasiaFan I was successfully able to construct model_main.py with multiple-gpu learning by your comments. Thanks a lot. But when I was comparing two ways for training with multiple gpu, 1) legacy/train.py --num_clones=2 --ps_tasks=1 2) model_main.py + tf.contrib.distribute.MirroredStrategy(num_gpus=2)

with same model(faster-rcnn-resnet101 and its configs), I found some errors.

Q1) when I set batch_size = 1 and gpu = 2, then 1) threw an error (ValueError: not enough values to unpack(expected 7, got 0) ) but 2) worked fine.

Q2) when I set batch_size = 8 with gpu = 1, then 1) and 2) worked fine. but if I increased batch_size to 16, and set gpu = 2, then 1) worked fine, but 2) threw an OOM error. (I use 8 gpus of RTX 2080ti - 11GB, tensorflow 1.14.0)

So I guess that MirroredStrategy does not work fine with b_size = batch for each gpu * num of gpus. And, from those experience results, I think training result of (batch_size=16, gpu=2 with legacy/train.py) is same as (batch_size=8, gpu=2 with model_main.py, MirroredStrategy). Did you check this point? If you do, or know this issue, then will you let me know about this?

Thanks.

tensorflow / models