Broadcast shape incompatible while training fpn

Karthik-Suresh93 commented 6 years ago

Below is the full traceback of the error I get. I am training fpn_dcn on my custom data in the pascal voc format. PLEASE help me with this error. Thank you in advance

$ python experiments/fpn/fpn_end2end_train_test.py --cfg experiments/fpn/cfgs/resnet_v1_101_voc_visdrone_trainval_fpn_dcn_end2end_ohem.yaml ('Called with argument:', Namespace(cfg='experiments/fpn/cfgs/resnet_v1_101_voc_visdrone_trainval_fpn_dcn_end2end_ohem.yaml', frequent=100)) {'CLASS_AGNOSTIC': False, 'MXNET_VERSION': 'mxnet', 'SCALES': [(800, 1280)], 'TEST': {'BATCH_IMAGES': 1, 'CXX_PROPOSAL': False, 'HAS_RPN': True, 'NMS': 0.3, 'PROPOSAL_MIN_SIZE': 0, 'PROPOSAL_NMS_THRESH': 0.7, 'PROPOSAL_POST_NMS_TOP_N': 2000, 'PROPOSAL_PRE_NMS_TOP_N': 20000, 'RPN_MIN_SIZE': 0, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SOFTNMS_THRESH': 0.6, 'USE_SOFTNMS': True, 'max_per_image': 100, 'test_epoch': 7}, 'TEST_SCALES': [[480, 800], [576, 900], [688, 1100], [800, 1200], [1200, 1600], [1400, 2000]], 'TRAIN': {'ALTERNATE': {'RCNN_BATCH_IMAGES': 0, 'RPN_BATCH_IMAGES': 0, 'rfcn1_epoch': 0, 'rfcn1_lr': 0, 'rfcn1_lr_step': '', 'rfcn2_epoch': 0, 'rfcn2_lr': 0, 'rfcn2_lr_step': '', 'rpn1_epoch': 0, 'rpn1_lr': 0, 'rpn1_lr_step': '', 'rpn2_epoch': 0, 'rpn2_lr': 0, 'rpn2_lr_step': '', 'rpn3_epoch': 0, 'rpn3_lr': 0, 'rpn3_lr_step': ''}, 'ASPECT_GROUPING': True, 'BATCH_IMAGES': 1, 'BATCH_ROIS': -1, 'BATCH_ROIS_OHEM': 512, 'BBOX_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZATION_PRECOMPUTED': True, 'BBOX_REGRESSION_THRESH': 0.5, 'BBOX_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_WEIGHTS': array([1., 1., 1., 1.]), 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'CXX_PROPOSAL': False, 'ENABLE_OHEM': True, 'END2END': True, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'FLIP': True, 'RESUME': False, 'RPN_BATCH_SIZE': 256, 'RPN_BBOX_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 0, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SHUFFLE': True, 'begin_epoch': 0, 'end_epoch': 7, 'lr': 0.01, 'lr_factor': 0.1, 'lr_step': '4,6', 'model_prefix': 'fpn_coco', 'momentum': 0.9, 'warmup': True, 'warmup_lr': 0.001, 'warmup_step': 250, 'wd': 0.0001}, 'dataset': {'NUM_CLASSES': 13, 'dataset': 'PascalVOC', 'dataset_path': '/scratch/user/k21993/visdrone/VisDrone2018/Code/data/VOCdevkit2007', 'image_set': '2007_trainval', 'proposal': 'rpn', 'root_path': './data', 'test_image_set': '2007_test'}, 'default': {'frequent': 100, 'kvstore': 'device'}, 'gpus': '0,1', 'network': {'ANCHOR_RATIOS': [0.5, 1, 2], 'ANCHOR_SCALES': [8], 'FIXED_PARAMS': ['conv1', 'bn_conv1', 'res2', 'bn2', 'gamma', 'beta'], 'FIXED_PARAMS_SHARED': ['conv1', 'bn_conv1', 'res2', 'bn2', 'res3', 'bn3', 'res4', 'bn4', 'gamma', 'beta'], 'IMAGE_STRIDE': 32, 'NUM_ANCHORS': 3, 'PIXEL_MEANS': array([103.06, 115.9 , 123.15]), 'RCNN_FEAT_STRIDE': 16, 'RPN_FEAT_STRIDE': [4, 8, 16, 32, 64], 'pretrained': './model/pretrained_model/resnet_v1_101', 'pretrained_epoch': 0}, 'output_path': './output/fpn/coco', 'symbol': 'resnet_v1_101_fpn_dcn_rcnn'} num_images 7009 voc_2007_trainval gt roidb loaded from ./data/cache/voc_2007_trainval_gt_roidb.pkl append flipped images to roidb filtered 22 roidb entries: 14018 -> 13996 providing maximum shape [('data', (1, 3, 800, 1280)), ('gt_boxes', (1, 100, 5))] [('label', (1, 255780)), ('bbox_target', (1, 12, 85260)), ('bbox_weight', (1, 12, 85260))] {'bbox_target': (1L, 12L, 72471L), 'bbox_weight': (1L, 12L, 72471L), 'data': (1L, 3L, 800L, 1088L), 'gt_boxes': (1L, 33L, 5L), 'im_info': (1L, 3L), 'label': (1L, 217413L)} ('lr', 0.01, 'lr_epoch_diff', [4.0, 6.0], 'lr_iters', [27992, 41988]) experiments/fpn/../../fpn/../lib/bbox/bbox_transform.py:82: RuntimeWarning: invalid value encountered in log targets_dw = np.log(gt_widths / ex_widths) experiments/fpn/../../fpn/operator_py/fpn_roi_pooling.py:30: RuntimeWarning: invalid value encountered in sqrt feat_id = np.clip(np.floor(2 + np.log2(np.sqrt(w * h) / 224)), 0, len(self.feat_strides) - 1) Error in CustomOp.forward: Traceback (most recent call last): File "/scratch/user/k21993/visdrone/Deformable-ConvNets/mxnet_python27/lib/python2.7/site-packages/mxnet/operator.py", line 987, in forward_entry aux=tensors[4]) File "experiments/fpn/../../fpn/operator_py/fpn_roi_pooling.py", line 88, in forward self.assign(out_data[0], req[0], roi_pool) File "/scratch/user/k21993/visdrone/Deformable-ConvNets/mxnet_python27/lib/python2.7/site-packages/mxnet/operator.py", line 468, in assign dst[:] = src File "/scratch/user/k21993/visdrone/Deformable-ConvNets/mxnet_python27/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 444, in setitem self._set_nd_basic_indexing(key, value) File "/scratch/user/k21993/visdrone/Deformable-ConvNets/mxnet_python27/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 699, in _set_nd_basic_indexing value = value.broadcast_to(shape) File "/scratch/user/k21993/visdrone/Deformable-ConvNets/mxnet_python27/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1677, in broadcast_to raise ValueError(err_str) ValueError: operands could not be broadcast together with remapped shapes[original->remapped]: (2070L, 256L, 7L, 7L) and requested shape (2072L, 256L, 7L, 7L)

chinakook commented 6 years ago

Your image must be divided by 32.

Karthik-Suresh93 commented 6 years ago

Do you mean for each pixel value we do: pixel_value = pixel_value/32 ?

By the way, this error only seems to be for FPN, I am able to run the code for RFCN. Please let me know why this is happening. Thank you very much for your help :)

chinakook commented 6 years ago

No, the image width and height must be divided by 32.

Karthik-Suresh93 commented 6 years ago

To make sure I understand you correctly, are you saying I downsample each image (width and height) by a factor of 32 before I feed it into training? Why is this so?

chinakook commented 6 years ago

The image width and height must be exactly dividable by 32. For example 1024 is OK but 1023 cannot be training because "1024 mod 32 = 0".

Karthik-Suresh93 commented 6 years ago

Hi, so should I make this change in the config file? If so the given scales are already divisible by 32

MXNET_VERSION: "mxnet" output_path: "./output/fpn/coco" symbol: resnet_v1_101_fpn_dcn_rcnn gpus: '0,1' CLASS_AGNOSTIC: false SCALES: - *800* - *1280* # TEST_SCALES: [[800, 1280]] # single scale testing TEST_SCALES: [[480, 800], [576, 900], [688, 1100], [800, 1200], [1200, 1600], [1400, 2000]] # multi-scale testing\

here, 800/32=25 and 1280/32=40 are both exactly divisible by 32 in the config file.

chinakook commented 6 years ago

I chang all my images to 1280*800 and get training successful.

Karthik-Suresh93 commented 6 years ago

How did you resize smaller images (I have 1000600, 800600 etc) to 1280800? Did you zero pad the smaller images? And for the bigger images (18001600), should I crop them? Thank you

chinakook commented 6 years ago

Big resized, small padded. You can random crop big image too.

Karthik-Suresh93 commented 6 years ago

Thank you very much for your help. I will make the changes and get back to you

Karthik-Suresh93 commented 6 years ago

I tried your suggestion, I get the same shape error at a different place

Error in CustomOp.forward: Traceback (most recent call last): File "/home/k21993/visdrone_mxnet/local/lib/python2.7/site-packages/mxnet/operator.py", line 782, in forward_entry aux=tensors[4]) File "experiments/fpn/../../fpn/operator_py/fpn_roi_pooling.py", line 88, in forward self.assign(out_data[0], req[0], roi_pool) File "/home/k21993/visdrone_mxnet/local/lib/python2.7/site-packages/mxnet/operator.py", line 455, in assign dst[:] = src File "/home/k21993/visdrone_mxnet/local/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 405, in setitem value.copyto(self) File "/home/k21993/visdrone_mxnet/local/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1635, in copyto return _internal._copyto(self, out=other) File "", line 25, in _copyto File "/home/k21993/visdrone_mxnet/local/lib/python2.7/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke ctypes.byref(out_stypes))) File "/home/k21993/visdrone_mxnet/local/lib/python2.7/site-packages/mxnet/base.py", line 146, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) MXNetError: [06:48:52] src/operator/nn/./../tensor/../elemwise_op_common.h:122: Check failed: assign(&dattr, (*vec)[i]) Incompatible attr in node at 0-th output: expected (2070,256,7,7), got (2072,256,7,7)

Can you help me with this?

chinakook commented 6 years ago

Use a smaller dataset to ensure it is not the dataset's problem. Use the newest Mxnet, install with 'pip install --pre mxnet-cu80 or 90 or 91.

Karthik-Suresh93 commented 6 years ago

I tried on fpn deformable conv net on the PASCAL VOC dataset and it still gives errors.

chinakook commented 6 years ago

I tried fpn with rcnn only.

Karthik-Suresh93 commented 6 years ago

yes, fpn rcnn seems to work. I think it might be a bug in their code

Karthik-Suresh93 commented 6 years ago

Hi @chinakook , I was finally able to get training to run on the images resized so that both height and width are divisible by 32. However, the same error has come up when I run fpn_test.py, even though the test images seem to be of shape 1024x1024 (both divisible by 32). Please help me regarding this

chinakook commented 6 years ago

Refer to mine repo https://github.com/chinakook/Deformable-ConvNets/tree/dev/fpn It’s compitable with mx-rcnn and has deploy and demo script

Karthik-Suresh93 commented 6 years ago

Thank you, I'll give it a try!

msracver / Deformable-ConvNets

Broadcast shape incompatible while training fpn #194

Hi, so should I make this change in the config file? If so the given scales are already divisible by 32