matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.72k stars 11.71k forks source link

Multi GPU error: InvalidArgumentError : ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 #567

Open JingyunLiang opened 6 years ago

JingyunLiang commented 6 years ago

With Ubuntu14.04, Python3.6.0, Tensorflow1.4.0, Keras2.0.8, using multi GPU output error as:

InvalidArgumentError : ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0
More detailed log is as follows (click me) ``` mrcnn_mask_deconv (TimeDistributed) mrcnn_class_logits (TimeDistributed) mrcnn_mask (TimeDistributed) /home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " /home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/training.py:1987: UserWarning: Using a generator with `use_multiprocessing=True` and multiple workers may duplicate your data. Please consider using the`keras.utils.Sequence class. UserWarning('Using a generator with `use_multiprocessing=True`' Epoch 1/30 2018-05-16 09:22:38.466424: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.466601: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.466769: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.466866: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.477960: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.554091: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.594089: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.596734: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.596861: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] 2018-05-16 09:22:38.596941: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] Traceback (most recent call last): File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call return fn(*args) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn status, run_metadata) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] [[Node: training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad/_5189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_16970_training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/ljy/Mask_RCNN/samples/balloon/balloon_nm.py", line 381, in train(model) File "/home/ljy/Mask_RCNN/samples/balloon/balloon_nm.py", line 214, in train layers='heads') File "/home/ljy/Mask_RCNN/mrcnn/model_detection.py", line 2329, in train use_multiprocessing=True, File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(*args, **kwargs) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/training.py", line 2042, in fit_generator class_weight=class_weight) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/training.py", line 1762, in train_on_batch outputs = self.train_function(ins) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2273, in __call__ **self.session_kwargs) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] [[Node: training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad/_5189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_16970_training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] Caused by op 'mrcnn_bbox_loss_1/concat', defined at: File "/home/ljy/Mask_RCNN/samples/balloon/balloon_nm.py", line 348, in model_dir=args.logs) File "/home/ljy/Mask_RCNN/mrcnn/model_detection.py", line 1824, in __init__ self.keras_model = self.build(mode=mode, config=config) File "/home/ljy/Mask_RCNN/mrcnn/model_detection.py", line 2043, in build model = ParallelModel(model, config.GPU_COUNT) File "/home/ljy/Mask_RCNN/mrcnn/parallel_model.py", line 37, in __init__ merged_outputs = self.make_parallel() File "/home/ljy/Mask_RCNN/mrcnn/parallel_model.py", line 102, in make_parallel m = KL.Concatenate(axis=0, name=name)(outputs) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/engine/topology.py", line 602, in __call__ output = self.call(inputs, **kwargs) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/layers/merge.py", line 332, in call return K.concatenate(inputs, axis=self.axis) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1709, in concatenate return tf.concat([to_dense(x) for x in tensors], axis) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1099, in concat return gen_array_ops._concat_v2(values=values, axis=axis, name=name) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 706, in _concat_v2 "ConcatV2", values=values, axis=axis, name=name) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/home/ljy/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__ self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): ConcatOp : Expected concatenating dimensions in the range [0, 0), but got 0 [[Node: mrcnn_bbox_loss_1/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](tower_0/mask_rcnn/mrcnn_bbox_loss/Mean/_4921, tower_1/mask_rcnn/mrcnn_bbox_loss/Mean/_4923, split_2/split_dim)]] [[Node: training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad/_5189 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_16970_training/SGD/gradients/tower_1/mask_rcnn/fpn_c4p4/BiasAdd_grad/BiasAddGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]] ```
shijungg commented 6 years ago

I got the same problem with 4 GPUs. I have not found a solution

liangbo-1 commented 6 years ago

I don't have such a question. Can you tell me specifically how you set up multi GPU?

shijungg commented 6 years ago

@liangbo-1 I have a machine with 4 GPUs. I just change the config parameter CPU_COUNT = 4

nicolaihaeni commented 6 years ago

I am having the same problem on Ubuntu 16.04, Python 3.6, Tensorflow 1.4 and Keras 2.0.8

taijizhao commented 6 years ago

I have the same problem on Ubuntu 16.04, Python 3.5, TF 1.4 and keras 2.1.2. Exactly the same error.

nemonameless commented 6 years ago

@MichaelLiang12 @taijizhao @Nicolai-Haeni Have anyone solved it ? Maybe because of the TF version?

taijizhao commented 6 years ago

@MichaelLiang12 I upgrade to TF 1.8 and Keras 2.1.6 and the problem disappeared, at least using multiple GPUs on the shapes sample works fine.

zhjpqq commented 6 years ago

I also have this Question. tf=1.3 keras=2.0.8 cuda=8.0 pyhon=3.4.0

how to slove it ??

taijizhao commented 6 years ago

@zhjpqq for me I upgrade my TF and Keras to the newest version and everything works fine now.

ababino commented 5 years ago

I had the same problem. I fixed it replacing line 97 of parallel_model.py

if K.int_shape(outputs[0]) == ():

by

if K.int_shape(outputs[0]) == () or not K.int_shape(outputs[0]):

I use python=3.6.8, tensorflow-gpu=1.3.0, and keras=2.08.

I could not update tensorflow because I am using a cluster with Nvidia driver version 375.26 and tensorflow>1.4 is not compatible with it. And I do not have root access to change the driver.

I hope this is useful.