Problem training on multiple GPUs

irnwritshin commented 7 years ago

Hi, excellent project, I had fun reading your code.

I did some expriments by reconfiguring the codes and the pretrained coco weight to adapt python2.7 and a 2 class problem. I also disabled multiprocessing as in #13 because I don't have root access on shared memory of the machine.

It worked fine when I was training on one GPU, but when I tried to run on multiple GPUs I run into a problem like this:

Traceback (most recent call last):
  File "coco_multi.py", line 266, in <module>
    model.load_weights(model_path, by_name=True, exclude = ['mrcnn_class_logits','mrcnn_bbox_fc','mrcnn_mask'])
  File "/home/wenyao/project/Mask_RCNN/model.py", line 2014, in load_weights
    topology.load_weights_from_hdf5_group_by_name(f, layers)
  File "/usr/local/lib//python27/lib/python2.7/site-packages/keras/engine/topology.py", line 3158, in load_weights_from_hdf5_group_by_name
    K.batch_set_value(weight_value_tuples)
  File "/usr/local/lib//python27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2193, in batch_set_value
    get_session().run(assign_ops, feed_dict=feed_dict)
  File "/usr/local/lib//python27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 167, in get_session
    _initialize_variables()
  File "/usr/local/lib//python27/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 341, in _initialize_variables
    sess.run(tf.variables_initializer(uninitialized_variables))
  File "/usr/local/lib//python27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib//python27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib//python27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib//python27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'tower_1/mask_rcnn/strided_slice_13': Could not satisfy explicit device specification '/device:GPU:1' because no supported kernel for GPU devices is available.
         [[Node: tower_1/mask_rcnn/strided_slice_13 = StridedSlice[Index=DT_INT32, T=DT_INT64, begin_mask=1, ellipsis_mask=0, end_mask=1, new_axis_mask=0, shrink_axis_mask=2, _device="/device:GPU:1"](tower_1/mask_rcnn/Where_3, tower_1/mask_rcnn/strided_slice_13/stack, tower_1/mask_rcnn/strided_slice_13/stack_1, tower_1/mask_rcnn/strided_slice_13/stack_2)]]
Caused by op u'tower_1/mask_rcnn/strided_slice_13', defined at:
  File "coco_multi.py", line 247, in <module>
    model_dir=args.logs)
  File "/home/wenyao/project/Mask_RCNN/model.py", line 1736, in __init__
    self.keras_model = self.build(mode=mode, config=config)
  File "/home/wenyao/project/Mask_RCNN/model.py", line 1947, in build
    model = ParallelModel(model, config.GPU_COUNT)
  File "/home/wenyao/project/Mask_RCNN/parallel_model.py", line 37, in __init__
    merged_outputs = self.make_parallel()
  File "/home/wenyao/project/Mask_RCNN/parallel_model.py", line 81, in make_parallel
    outputs = self.inner_model(inputs)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/keras/engine/topology.py", line 602, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/keras/engine/topology.py", line 2058, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/keras/engine/topology.py", line 2248, in run_internal_graph
    shapes = _to_list(layer.compute_output_shape([x._keras_shape for x in computed_tensors]))
  File "/usr/local/lib/python27/lib/python2.7/site-packages/keras/layers/core.py", line 613, in compute_output_shape
    x = self.call(xs)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/keras/layers/core.py", line 650, in call
    return self.function(inputs, **arguments)
  File "/home/wenyao/project/Mask_RCNN/model.py", line 1901, in <lambda>
    mask_loss = KL.Lambda(lambda x: mrcnn_mask_loss_graph(*x), name="mrcnn_mask_loss")(
  File "/home/wenyao/project/Mask_RCNN/model.py", line 1118, in mrcnn_mask_loss_graph
    positive_ix = tf.where(target_class_ids > 0)[:, 0]
  File "/usr/local/lib/python27/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 538, in _SliceHelper
    name=name)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 706, in strided_slice
    shrink_axis_mask=shrink_axis_mask)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5430, in strided_slice
    name=name)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'tower_1/mask_rcnn/strided_slice_13': Could not satisfy explicit device specification '/device:GPU:1' because no supported kernel for GPU devices is available.
         [[Node: tower_1/mask_rcnn/strided_slice_13 = StridedSlice[Index=DT_INT32, T=DT_INT64, begin_mask=1, ellipsis_mask=0, end_mask=1, new_axis_mask=0, shrink_axis_mask=2, _device="/device:GPU:1"](tower_1/mask_rcnn/Where_3, tower_1/mask_rcnn/strided_slice_13/stack, tower_1/mask_rcnn/strided_slice_13/stack_1, tower_1/mask_rcnn/strided_slice_13/stack_2)]]

This exception seems to be raised while the coco weights are being loaded. However, the parallel_model.py file runs perfectly on its test code.

I did many searches and I can't solve this one. Is there anyone have a similar problem?

I'm running on Keras(2.0.8) and Tensorflow(1.4.0)

waleedka commented 6 years ago

This is not a solution, but may help you on your journey to find a solution. This is the error you're getting, and I think I've seen something similar before.

Cannot assign a device for operation 'tower_1/mask_rcnn/strided_slice_13': Could not satisfy explicit device specification '/device:GPU:1' because no supported kernel for GPU devices is available

I was trying to setup TensorFlow debugging and needed to change some setting in TF session. By default, TF tries to put OPs on the device you specify but if it can't then it puts it on the CPU. The change I did (sorry, I don't remember it now) caused TF to try to strictly enforce device placement. Some OPs don't have a GPU implementation, so they can't be placed on the GPU.

Either try to find out what you changed that caused TF to enforce device placement. Or, manually find all the OPs that don't have GPU implementation and wrap them with tf.device() that explicitly force them to be on the CPU.

irnwritshin commented 6 years ago

@waleedka Thank you so much for the reply, it's actually very helpful !

So, I've located the probem: The strided_slice is used when I manually configured tensorflow to limit GPU memory usage. If I disable the config, there is no bug. The problem is half solved !

However, I never run into this problem when I'm running on one GPU, which led me to look up if some incompatibility is coming from the encapsulation part of parallel_model.py. So far I didn't find anything yet.

Another weird thing that I can't get my head around is that the problem seems to only occur on the second device(GPU:1). Any thoughts?

waleedka commented 6 years ago

When you use one GPU, the parallel model is not called. Which means that tf.device() is not used, and therefore OPs placement follows the default setting (put on GPU if available, otherwise put on CPU). Once you use GPU > 1, parallel model is used and it tries to direct TF to put OPs on specific GPUs.

irnwritshin commented 6 years ago

@waleedka Thank you for the reply, I'll post the solution here if I found one later on.

matterport / Mask_RCNN

Problem training on multiple GPUs #106