Custom Training: Infer_shape error when training rcnn

ghost commented 7 years ago

I'm trying to use multiple gpus, which is 4, to train mx-rcnn. The code I use is train_alternate.py. When it runs to the step of training rcnn, it gives me such error message. Could anyone help me with this? Thanks!

num_images 77 voc_2007_trainval gt roidb loaded from data/cache/voc_2007_trainval_gt_roidb.pkl loading data/rpn_data/voc_2007_trainval_rpn.pkl append flipped images to roidb add bounding box regression targets /home/mxnet_workspace/mx-rcnn/rcnn/processing/bbox_regression.py:94: RuntimeWarning: invalid value encountered in divide roidb[im_i]['bbox_targets'][cls_indexes, 1:] /= stds[cls, :] infer_shape error. Arguments: rois: (4L, 64L, 5L) label: (4L, 64L) data: (4L, 3L, 600L, 674L) bbox_target: (4L, 64L, 72L) bbox_weight: (4L, 64L, 72L) Traceback (most recent call last): File "train_alternate.py", line 95, in main() File "train_alternate.py", line 92, in main alternate_train(args, ctx, args.pretrained, args.epoch, args.rpn_epoch, args.rcnn_epoch) File "train_alternate.py", line 31, in alternate_train train_rcnn(args, ctx, pretrained, epoch, 'model/rcnn1', begin_epoch, rcnn_epoch) File "/home/mxnet_workspace/mx-rcnn/rcnn/tools/train_rcnn.py", line 61, in train_rcnn arg_shape, out_shape, aux_shape = sym.infer_shape(*data_shape_dict) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.8.0-py2.7.egg/mxnet/symbol.py", line 459, in infer_shape return self._infer_shape_impl(False, args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.8.0-py2.7.egg/mxnet/symbol.py", line 526, in _infer_shape_impl ctypes.byref(complete))) File "/usr/local/lib/python2.7/dist-packages/mxnet-0.8.0-py2.7.egg/mxnet/base.py", line 77, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: InferShape Error in _minus2's rhs argument Shape inconsistent, Provided=(1536,12), inferred shape=(256,12)

ijkguo commented 7 years ago

There is only one _minus operation in get_vgg_rcnn. What changes did you do?

ghost commented 7 years ago

@precedenceguo I only changed the class labels in pascal_voc.py to fit my datasets and set config.TEST.CXX_PROPOSAL = False, for 'Symbol' doesn't have the 'Proposal' attribute. Here is the get_vgg_rcnn code I use:

data = mx.symbol.Variable(name="data")
rois = mx.symbol.Variable(name='rois')
label = mx.symbol.Variable(name='label')
bbox_target = mx.symbol.Variable(name='bbox_target')
bbox_weight = mx.symbol.Variable(name='bbox_weight')

# reshape input
rois = mx.symbol.Reshape(data=rois, shape=(-1, 5), name='rois_reshape')
label = mx.symbol.Reshape(data=label, shape=(-1, ), name='label_reshape')
bbox_target = mx.symbol.Reshape(data=bbox_target, shape=(-1, 4 * num_classes), name='bbox_target_reshape')
bbox_weight = mx.symbol.Reshape(data=bbox_weight, shape=(-1, 4 * num_classes), name='bbox_weight_reshape')

# shared convolutional layers
relu5_3 = get_vgg_conv(data)

# Fast R-CNN
pool5 = mx.symbol.ROIPooling(
    name='roi_pool5', data=relu5_3, rois=rois, pooled_size=(7, 7), spatial_scale=1.0 / config.RCNN_FEAT_SRTIDE)
# group 6
flatten = mx.symbol.Flatten(data=pool5, name="flatten")
fc6 = mx.symbol.FullyConnected(data=flatten, num_hidden=4096, name="fc6")
relu6 = mx.symbol.Activation(data=fc6, act_type="relu", name="relu6")
drop6 = mx.symbol.Dropout(data=relu6, p=0.5, name="drop6")
# group 7
fc7 = mx.symbol.FullyConnected(data=drop6, num_hidden=4096, name="fc7")
relu7 = mx.symbol.Activation(data=fc7, act_type="relu", name="relu7")
drop7 = mx.symbol.Dropout(data=relu7, p=0.5, name="drop7")
# classification
cls_score = mx.symbol.FullyConnected(name='cls_score', data=drop7, num_hidden=num_classes)
cls_prob = mx.symbol.SoftmaxOutput(name='cls_prob', data=cls_score, label=label, normalization='batch')
# bounding box regression
bbox_pred = mx.symbol.FullyConnected(name='bbox_pred', data=drop7, num_hidden=num_classes * 4)
bbox_loss_ = bbox_weight * mx.symbol.smooth_l1(name='bbox_loss_', scalar=1.0, data=(bbox_pred - bbox_target))
bbox_loss = mx.sym.MakeLoss(name='bbox_loss', data=bbox_loss_, grad_scale=1.0 / config.TRAIN.BATCH_ROIS)

# reshape output
cls_prob = mx.symbol.Reshape(data=cls_prob, shape=(config.TRAIN.BATCH_IMAGES, -1, num_classes), name='cls_prob_reshape')
bbox_loss = mx.symbol.Reshape(data=bbox_loss, shape=(config.TRAIN.BATCH_IMAGES, -1, 4 * num_classes), name='bbox_loss_reshape')

# group output
group = mx.symbol.Group([cls_prob, bbox_loss])
return group

ijkguo commented 7 years ago

We can see that the only _minus is bbox_pred - bbox_target. Use sym.tojson() to check where is the _minus2

ghost commented 7 years ago

@precedenceguo Yes, it indeed has only one 'minus', but it's name is _minus2. Part of the .json file is like below:

{ "op": "null", "param": {}, "name": "bbox_target", "inputs": [], "backward_source_id": -1 }, { "op": "Reshape", "param": { "keep_highest": "False", "reverse": "False", "shape": "(-1,84)", "target_shape": "(0,0)" }, "name": "bbox_target_reshape", "inputs": [[83, 0]], "backward_source_id": -1 }, { "op": "Minus", "param": {}, "name": "minus2", "inputs": [[82, 0], [84, 0]], "backward_source_id": -1 }, { "op": "smooth_l1", "param": {"scalar": "1"}, "name": "bboxloss", "inputs": [[85, 0]], "backward_source_id": -1 }, { "op": "_Mul", "param": {}, "name": "_mul2", "inputs": [[79, 0], [86, 0]], "backward_source_id": -1 }, { "op": "MakeLoss", "param": { "grad_scale": "0.0078125", "normalization": "null", "valid_thresh": "0" }, "name": "bbox_loss", "inputs": [[87, 0]], "backward_source_id": -1 },

ijkguo commented 7 years ago

OK, so mxnet.base.MXNetError: InferShape Error in _minus2's rhs argument Shape inconsistent, Provided=(1536,12), inferred shape=(256,12) means that lhs is shaped (1536,12) and rhs is shaped (256,12). lhs is bbox_pred and rhs is bbox_target.

Would you please check their shape again, which refer to the bbox_pred fc and the bbox_target io?

ijkguo commented 7 years ago

Why not checkout coco as a different dataset?

ghost commented 7 years ago

@precedenceguo Thank you! I will checkout coco to give a try.

ijkguo / mx-rcnn

Custom Training: Infer_shape error when training rcnn #40