matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.65k stars 11.7k forks source link

An error raise when positive_overlaps has shape (0,0), how to fix? #170

Open ypflll opened 6 years ago

ypflll commented 6 years ago

Sometimes, a picture gives no positive_roi, and this will raise an error:

W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Reduction axis 1 is empty in shape [0,0] [[Node: proposal_targets/ArgMax = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](proposal_targets/Gather_5, rpn_class_loss/Equal/y)]] Traceback (most recent call last): File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call return fn(*args) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn status, run_metadata) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Reduction axis 1 is empty in shape [0,0] [[Node: proposal_targets/ArgMax = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](proposal_targets/Gather_5, rpn_class_loss/Equal/y)]] [[Node: roi_align_classifier/Cast_2/_7079 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8551_roi_align_classifier/Cast_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "portrait_seg.py", line 206, in layers='4+') File "/home/xxx/Desktop/keras_Mask_RCNN/model.py", line 2211, in train use_multiprocessing=True, File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(*args, kwargs) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/keras/engine/training.py", line 2096, in fit_generator class_weight=class_weight) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/keras/engine/training.py", line 1814, in train_on_batch outputs = self.train_function(ins) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2352, in call self.session_kwargs) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Reduction axis 1 is empty in shape [0,0] [[Node: proposal_targets/ArgMax = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](proposal_targets/Gather_5, rpn_class_loss/Equal/y)]] [[Node: roi_align_classifier/Cast_2/_7079 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8551_roi_align_classifier/Cast_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'proposal_targets/ArgMax', defined at: File "portrait_seg.py", line 168, in model_dir=MODEL_DIR) File "/home/xxx/Desktop/keras_Mask_RCNN/model.py", line 1744, in init self.keras_model = self.build(mode=mode, config=config) File "/home/xxx/Desktop/keras_Mask_RCNN/model.py", line 1885, in build target_rois, input_gt_class_ids, gt_boxes, input_gt_masks]) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/keras/engine/topology.py", line 603, in call output = self.call(inputs, kwargs) File "/home/xxx/Desktop/keras_Mask_RCNN/model.py", line 641, in call self.config.IMAGES_PER_GPU, names=names) File "/home/xxx/Desktop/keras_Mask_RCNN/utils.py", line 673, in batch_slice output_slice = graph_fn(inputs_slice) File "/home/xxx/Desktop/keras_Mask_RCNN/model.py", line 640, in w, x, y, z, self.config), File "/home/xxx/Desktop/keras_Mask_RCNN/model.py", line 544, in detection_targets_graph roi_gt_box_assignment = tf.argmax(positive_overlaps, axis=1) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 316, in new_func return func(args, kwargs) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 205, in argmax return gen_math_ops.arg_max(input, axis, name=name, output_type=output_type) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 441, in arg_max name=name) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/home/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Reduction axis 1 is empty in shape [0,0] [[Node: proposal_targets/ArgMax = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"](proposal_targets/Gather_5, rpn_class_loss/Equal/y)]] [[Node: roi_align_classifier/Cast_2/_7079 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8551_roi_align_classifier/Cast_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I tried to fix this by add a check if there is a positive_indices, in model.py, line528:

# Subsample ROIs. Aim for 33% positive
# Positive ROIs
positive_count = int(config.TRAIN_ROIS_PER_IMAGE *
                     config.ROI_POSITIVE_RATIO)
positive_indices = tf.random_shuffle(positive_indices)[:positive_count]
a = positive_indices.get_shape()[0]
if a == 0:
    positive_indices = tf.constant([1])
positive_count = tf.shape(positive_indices)[0]

However, it can avoid some wrong cases, but still raise error in other cases. How to fix this throghly?

horvitzs commented 6 years ago

I had same problem. When I tried following codes, if worked for me `

Assign positive ROIs to GT boxes.

positive_overlaps = tf.gather(overlaps, positive_indices)
roi_gt_box_assignment = tf.cond(tf.greater(tf.shape(positive_overlaps)[1], 0), true_fn = lambda: tf.argmax(positive_overlaps, axis=1), false_fn = lambda: tf.cast(tf.constant([]),tf.int64) )` (https://github.com/tensorflow/models/pull/1986/files)

ypflll commented 6 years ago

Thank for sharing. I tried your code, it worked, only partially: When set the tensor to size 0, it causes another problem, like this: https://github.com/tensorflow/tensorflow/issues/14962

Wondering what is your tf version? 1.5.0 has fixed this, but I am using 1.4.1.

horvitzs commented 6 years ago

That's strange. My tf version is also 1.4.1

ypflll commented 6 years ago

The error is: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 1 spatial: 28 28 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM Aborted (core dumped)

My cudn version: 8.0.44, cudnn version: 5.1.10.

@ppwwyyxx Have your PR: https://github.com/tensorflow/tensorflow/issues/14657 is inclued in tf version 1.4.1? I also find that after the error occured, my gpu(geforce gtx titanx, 12g) memory is always occupied with no process found. Maybe your PR did not solve the problem throughly. I'm not sure.

ppwwyyxx commented 6 years ago

No. The fix will probably in 1.6. You can use tf.cond to work around the bug like this: https://github.com/ppwwyyxx/tensorpack/blob/6bdd046057e507087f6da3af909d4bcf1726cff2/examples/FasterRCNN/train.py#L122-L133

ypflll commented 6 years ago

@ppwwyyxx Thanks for your code. I've thought it would be easy following your code, but get stuck still.

The primary code is: mrcnn_mask = build_fpn_mask_graph(rois, mrcnn_feature_maps, config.IMAGE_SHAPE, config.MASK_POOL_SIZE, config.NUM_CLASSES)

Like your code, I changed it to: def ff_true(): mrcnn_mask = build_fpn_mask_graph(rois, mrcnn_feature_maps, config.IMAGE_SHAPE, config.MASK_POOL_SIZE, config.NUM_CLASSES) return mrcnn_mask

def ff_false(): return target_mask

mrcnn_mask = tf.cond(tf.equal(tf.reduce_mean(rois), 0), ff_true, ff_true)

This raise an error:

ValueError: Initializer for variable cond/mrcnn_class_conv1/kernel/ is from inside a control-flow construct, such as a loop or conditional. When creating a variable inside a loop or conditional, use a lambda as the initializer.

Google it and seems that it's a matter a datatype, which tf also has a bug in error reporting, like this: https://github.com/tensorflow/tensorflow/issues/14729

So, I tried this: a = build_fpn_mask_graph(rois, mrcnn_feature_maps, config.IMAGE_SHAPE, config.MASK_POOL_SIZE, config.NUM_CLASSES)

def ff_true(): return a

def ff_false(): return target_mask

mrcnn_mask = tf.cond(tf.equal(tf.reduce_mean(rois), 0), ff_true, ff_true)

Also give me another error:

Traceback (most recent call last): File "coco.py", line 453, in model_dir=args.logs) File "/xxx/model.py", line 1775, in init self.keras_model = self.build(mode=mode, config=config) File "/xxx/model.py", line 2022, in build model = KM.Model(inputs, outputs, name='mask_rcnn') File "/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(*args, **kwargs) File "/xxx/anaconda2/envs/py36/lib/python3.6/site-packages/keras/engine/topology.py", line 1579, in init 'Keras tensors. Found: ' + str(x)) TypeError: Output tensors to a Model must be Keras tensors. Found: Tensor("cond/Merge:0", shape=(?, 200, 28, 28, 81), dtype=float32)

A little clueless to this. Any clue will be welcomed!

@Prausome You didn't meet this on tf version 1.4.1 seems strange to me. Do your runs coco.py with no modification? If you change the iou threshold bigger(like 2), maybe you can reproduce this: https://github.com/matterport/Mask_RCNN/blob/1c51787d8d8a5e4d08667178428cb99e31143713/model.py#L519

ppwwyyxx commented 6 years ago

TypeError: Output tensors to a Model must be Keras tensors. Found: Tensor("cond/Merge:0", shape=(?, 200, 28, 28, 81), dtype=float32)

I think the error is saying that Keras has a restrictions on the type of models you can use. But I don't know much about Keras to tell more.

This issue would only happen very occasionally in my experience. So I'm not surprised if someone doesn't see the same error.

ypflll commented 6 years ago

@ppwwyyxx Many thanks for your useful advice. I figure this out. It's exactly a difference between tf and keras tensor.

Keras tensors are theano/tf tensors with additional information included. You get keras tensors from keras.layers.Input or any time you pass an Input to a keras.layers.Layer. If you're just using the tensor in a loss calculation or something else, you don't have to wrap it in Lambdas. Refer to: https://github.com/keras-team/keras/issues/6263 This code works: mrcnn_mask = KL.Lambda(lambda x: tf.cond(tf.equal(tf.reduce_mean(x), 0),ff_true, ff_true)) (rois)

However, after fixing this, another error still exists:

ValueError: Initializer for variable cond/mrcnn_class_conv1/kernel/ is from inside a control-flow construct, such as a loop or conditional. When creating a variable inside a loop or conditional, use a lambda as the initializer.

I've raised this on stackoverflow: https://stackoverflow.com/questions/48515034/keras-tensorflow-initializer-for-variable-is-from-inside-a-control-flow-con Maybe I can get some clue to fix this bug finally.

ppwwyyxx commented 6 years ago

This is just another problem of Keras. Keras uses tensors to initialize variables, which is not legal inside conditional. tf.keras does not have this issue, btw.

waleedka commented 6 years ago

@ypflll Is issue solved for you?

I tried to reproduce the error you got, but I can't reproduce it at the moment. I forced positive_overlaps to be [], and the training is working as usual for me without any change to the code. I'm testing on TF 1.5 at the moment. Probably the latest version of TF solved it?

ypflll commented 6 years ago

Sorry for late reply, cause I'm on holiday.

I met the problem: 'Reduction axis 1 is empty in shape [0,0]' on the adobe portrait dataset, not on coco: http://xiaoyongshen.me/webpage_portrait/index.html On coco, I tried to set positive_overlaps to [] as you and no problem reaised.

Actually, I follow Prausome's code and solved this problem: # Assign positive ROIs to GT boxes. positive_overlaps = tf.gather(overlaps, positive_indices) roi_gt_box_assignment = tf.cond(tf.greater(tf.shape(positive_overlaps)[1], 0), true_fn = lambda: tf.argmax(positive_overlaps, axis=1), false_fn = lambda: tf.cast(tf.constant([]),tf.int64) )

Code for portrait segmentation is here: https://github.com/ypflll/portrait_seg_maskrcnn/blob/master/portrait.py Things not clear for me are what causes this problem and how to reproduce it.

The second problem I found as above is: F tensorflow/stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 1 spatial: 28 28 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} to cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM Aborted (core dumped)

I've thought it's caused by tf Conv2D backwards doesn't support zero batch size, like: https://github.com/tensorflow/tensorflow/issues/14657. After debugging by adding my code on your primary code, I find that this occurs when I add a new loss, code is here: https://github.com/ypflll/portrait_seg_maskrcnn/blob/master/model.py When the first problem occurs, some tensors are NULL, and this causes the second problem when running the code I add. This may beyond this issue. I will explore it later when I have time.

kstseng commented 6 years ago

follow the answer from @horvitzs , and it works! Thanks!

ziyigogogo commented 6 years ago

@waleedka i am using 1.7.0 version of tensorflow and met the same problem, so i do not think it's related to tensorflow version at the moment. I am now trying the code from @horvitzs, not sure if it works. But i will update.

By the way, I met this problem while trying to run this project: https://github.com/crowdAI/crowdai-mapping-challenge-mask-rcnn. The full datasets has this problem while the small subset dataset does not. I am trying to locate the particular image. If i successed, i will post that image here to see if you can reproduce the same problem.

waleedka commented 6 years ago

I merged a PR by @julienr that might help with this issue. It's similar to the fix suggested above by @horvitzs.

I did more testing on this case focusing on detection_targets_graph(). This function tries to match ROIs with ground truth boxes. I focused on these two edge cases:

  1. No positive ROIs
  2. No ground truth boxes

1. No positive ROIs

If I update the code to simulate not finding any positive ROIs, I get a crash and core dump on TF 1.8.

./tensorflow/core/util/cuda_launch_config.h:127] Check failed: work_element_count > 0 (0 vs. 0)
Aborted (core dumped)

I verified that detection_target_graph() is running all the way through without problems. So the error is happening at some point after that.

2. No ground truth boxes

When I update the code to simulate not having any ground truth boxes, I get the same error reported by @ypflll above. The recently merged PR fixes this issue. So now detection_target_graph() runs through correctly, and then I get a crash and core dump after that, just like the above.

So, the good news: we fixed the error reported above. The bad news: One of the TF operations are crashing when it receives a tensor with one of it's dimensions as 0.

fedebayle commented 6 years ago

By the way, I met this problem while trying to run this project: https://github.com/crowdAI/crowdai-mapping-challenge-mask-rcnn. The full datasets has this problem while the small subset dataset does not. I am trying to locate the particular image. If i successed, i will post that image here to see if you can reproduce the same problem.

@ziyigogogo Did you located the particular image? I'm experimenting the same problem with the full dataset.