Multiple GPU in model_main.py (since there is no more train.py)

waltermaldonado commented 5 years ago

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

What is the top-level directory of the model you are using: research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.10.1-0-g4dcfddc5d1 1.10.1
Bazel version (if compiling from source): NA
CUDA/cuDNN version: V9.0.176
GPU model and memory: 2x Tesla P100 16Gb
Exact command to reproduce: NA

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Greetings,

I would like to know how to proceed to use both of my GPUs for training a Faster R-CNN with NASNet-A featurization model with the model_main.py file included in the object_detection tools now that train.py is gone. If it is not possible, I would like to request this feature or a workaround to make it work.

Thanks in advance.

CasiaFan commented 5 years ago

@kjkim-kr Thanks for reporting this case. I didn't try legacy/train.py but I checked the trainer.py and found it allocated data queue and training process to each gpu using tf.device directly, which should be more efficient since a fairly fundamental api. Have you tried to set a smaller batch-size in both modes and to check if the MirroredStrategy mode will occupy more memory?

sp-ananth commented 4 years ago

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.

Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

@pkulzc

harsh306 commented 4 years ago

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable.

Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

sp-ananth commented 4 years ago

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable. Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

Yes, as I said above I was wondering whether or not there was a fix to multi-GPU training with model_main.py and not legacy/train.py.

model_main.py is the more recent release that allows evaluation on the fly and ideally should support multi-GPU for an object detection task.

shiragit commented 4 years ago

Hello there, are there any updates on this? We could switch over to the legacy train.py, but model_main.py does evaluation on the test set during training and is definitely preferable. Also, can we get an update on whether the solution provided by @CasiaFan (thanks!) is an acceptable one?

Yes, I switched to legacy/train and it worked.

@harsh306, have you managed to preform eval.py after training with num_clones > 1?

Adblu commented 4 years ago

@royshil That solution seems to be extremely slow and its not utilizes full power of all GPUs. Any newer update from this year ?

qraleq commented 3 years ago

Hi, did anyone manage to get model_main.py working in a multi-GPU setting and in an efficient manner (not being slower, and utilizing all the GPUs)?

davitv commented 3 years ago

Hi! Anyone achieved running model_main with multi-gpu env? Thanks to @CasiaFan 's answer i was able to run it, but it is extremely slow (but nvidia-smi shows that all gpus are used, and actually it looks like training freezes at step 0).

sainisanjay commented 3 years ago

--worker_replicas=2 --num_clones=2 --ps_tasks=1"

@laksgreen @lighTQ Could you please help me what is the difference between --worker_replicas=2 --num_clones=2 --ps_tasks=1" ? I can start the multi- GPU training using train.py but I can see my GPU's have been utilized only 30-40% of each. How I can increased the GPU utilization? I have 8 GPU's in a single machine.

sainisanjay commented 3 years ago

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU: Python object_detection/legacy/train.py --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config --train_dir=/root/research/Datasets/bak/model/myTrain2 --worker_replicas=2 --num_clones=2 --ps_tasks=1

I have done the Multi-GPU training using above commands. Training was done successfully. Keep in mind after training when you export the model checkpoints using object_detection/export_inference_graph.py it will give you error like issue https://github.com/tensorflow/models/issues/5625. That means when you use multiple gpu's for training variable names are changes in the graph (each nodes or variables are replaced with clone_1/nodeName similarly others ). In order to solve this issue you have to removed the clone or clone_1 from the graph and then you have to export the checkpoints. I have wrote my python script to removed these extra clone and clone_1

Source82 commented 3 years ago

I used the following script to complete the use of the fast_rcnn_resnet50 single-machine multi-GPU: Python object_detection/legacy/train.py --pipeline_config_path=/root/research/Datasets/bak/model/pipeline.config --train_dir=/root/research/Datasets/bak/model/myTrain2 --worker_replicas=2 --num_clones=2 --ps_tasks=1

I have done the Multi-GPU training using above commands. Training was done successfully. Keep in mind after training when you export the model checkpoints using object_detection/export_inference_graph.py it will give you error like issue #5625. That means when you use multiple gpu's for training variable names are changes in the graph (each nodes or variables are replaced with clone_1/nodeName similarly others ). In order to solve this issue you have to removed the clone or clone_1 from the graph and then you have to export the checkpoints. I have wrote my python script to removed these extra clone and clone_1

Please can you share the code you have written to tackle the clone issue. Thanks

sainisanjay commented 3 years ago

@Source82

import sys, getopt
import tensorflow as tf

usage_str = 'python tensorflow_rename_variables.py --checkpoint_dir=path/to/dir/ --dry_run'

def rename(checkpoint_dir, dry_run):
    checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
    with tf.Session() as sess:
        for var_name, _ in tf.contrib.framework.list_variables(checkpoint_dir):
            var = tf.contrib.framework.load_variable(checkpoint_dir, var_name)
            pos1 = var_name.find('clone_')
            pos2 = var_name.find('/clone_')
            posf = len(var_name)
            print(pos1)
            print(pos2)
            # Set the new name
            new_name = var_name
            if (pos1 != -1) and (pos2 != -1):
                new_name = var_name[pos2+9:posf]
            if (pos1 == 0) and (pos2 == -1):
                new_name = var_name[8:posf]
            if dry_run:
                print('%s would be renamed to %s.' % (var_name, new_name))
            else:
                print('Renaming %s to %s.' % (var_name, new_name))
                # Rename the variable
                var = tf.Variable(var, name=new_name)

    if not dry_run:
        # Save the variables
        saver = tf.train.Saver()
        sess.run(tf.global_variables_initializer())
        saver.save(sess, checkpoint.model_checkpoint_path)

def main(argv):
    checkpoint_dir = None
    dry_run = False

    try:
        opts, args = getopt.getopt(argv, 'h', ['help=', 'checkpoint_dir=','dry_run='])
    except getopt.GetoptError as e:
            print(usage_str)
            sys.exit(2)

    for opt, arg in opts:
        if opt in ('-h', '--help'):
            print(usage_str)
            sys.exit()
        elif opt == '--checkpoint_dir':
            checkpoint_dir = arg
        elif opt == '--dry_run':
            dry_run = True

    if not checkpoint_dir:
        print('Please specify a checkpoint_dir. Usage:')
        print(usage_str)
        sys.exit(2)
    rename(checkpoint_dir,  dry_run)

if __name__ == '__main__':
    main(sys.argv[1:])

Source82 commented 3 years ago

@Source82

import sys, getopt
import tensorflow as tf

usage_str = 'python tensorflow_rename_variables.py --checkpoint_dir=path/to/dir/ --dry_run'

def rename(checkpoint_dir, dry_run):
  checkpoint = tf.train.get_checkpoint_state(checkpoint_dir)
  with tf.Session() as sess:
      for var_name, _ in tf.contrib.framework.list_variables(checkpoint_dir):
          var = tf.contrib.framework.load_variable(checkpoint_dir, var_name)
          pos1 = var_name.find('clone_')
          pos2 = var_name.find('/clone_')
          posf = len(var_name)
          print(pos1)
          print(pos2)
          # Set the new name
          new_name = var_name
          if (pos1 != -1) and (pos2 != -1):
              new_name = var_name[pos2+9:posf]
          if (pos1 == 0) and (pos2 == -1):
              new_name = var_name[8:posf]
          if dry_run:
              print('%s would be renamed to %s.' % (var_name, new_name))
          else:
              print('Renaming %s to %s.' % (var_name, new_name))
              # Rename the variable
              var = tf.Variable(var, name=new_name)

  if not dry_run:
      # Save the variables
      saver = tf.train.Saver()
      sess.run(tf.global_variables_initializer())
      saver.save(sess, checkpoint.model_checkpoint_path)

def main(argv):
  checkpoint_dir = None
  dry_run = False

  try:
      opts, args = getopt.getopt(argv, 'h', ['help=', 'checkpoint_dir=','dry_run='])
  except getopt.GetoptError as e:
          print(usage_str)
          sys.exit(2)

  for opt, arg in opts:
      if opt in ('-h', '--help'):
          print(usage_str)
          sys.exit()
      elif opt == '--checkpoint_dir':
          checkpoint_dir = arg
      elif opt == '--dry_run':
          dry_run = True

  if not checkpoint_dir:
      print('Please specify a checkpoint_dir. Usage:')
      print(usage_str)
      sys.exit(2)
  rename(checkpoint_dir,  dry_run)

if __name__ == '__main__':
  main(sys.argv[1:])

Thanks

tensorflow / models