Minival Coco training error

muditchaudhary commented 5 years ago

I have added the minival coco2014 dataset.

I am trying to train on valminusminival. I have modified by config file as follows:

dataset_type = 'CocoDataset'
data_root = 'data/coco2014/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
data = dict(
    imgs_per_gpu=1,
    workers_per_gpu=1,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/minival/instances_valminusminival2014.json',
        img_prefix=data_root + 'val2014/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0.5,
        with_mask=False,
        with_crowd=False,
        with_label=True),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/minival/instances_minival2014.json',
        img_prefix=data_root + 'val2014/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=False,
        with_crowd=False,
        with_label=True),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/minival/instances_minival2014.json',
        img_prefix=data_root + 'val2014/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=False,
        with_crowd=False,
        with_label=False,
        test_mode=True))

The model trains for a few steps (Epoch[1] (4900/17593)) and then gives the following error:

RuntimeError: input image is smaller than kernel (shape_check at mmdet/ops/dcn/src/deform_conv_cuda.cpp:127)

I believe it has to do something image_scale or size_divisor? How can I debug this?

Lanselott commented 5 years ago

data = dict( imgs_per_gpu=4, workers_per_gpu=4, train=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_train2017.json', img_prefix=data_root + 'train2017/', img_scale=(1333, 800), img_norm_cfg=img_norm_cfg, size_divisor=32, flip_ratio=0.5, with_mask=False, with_crowd=False, with_label=True), val=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_val2017.json', img_prefix=data_root + 'val2017/', img_scale=(1333, 800), img_norm_cfg=img_norm_cfg, size_divisor=32, flip_ratio=0, with_mask=False, with_crowd=False, with_label=True),

you should download train set from coco page. The minival is for testing.

muditchaudhary commented 5 years ago

Noted. Thanks

Lanselott commented 5 years ago

trainval35k is for training (115k images) and minival (5k images) for testing. Both have annotations.

muditchaudhary commented 5 years ago

I tried the configuration you provided by just modifying the batch size but it is still giving me the same error after few steps of training.

dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
data = dict(
    imgs_per_gpu=1,
    workers_per_gpu=1,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0.5,
        with_mask=False,
        with_crowd=False,
        with_label=True),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=False,
        with_crowd=False,
        with_label=True),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        img_scale=(1333, 800),
        img_norm_cfg=img_norm_cfg,
        size_divisor=32,
        flip_ratio=0,
        with_mask=False,
        with_crowd=False,
        with_label=False,
        test_mode=True))

Lanselott commented 5 years ago

RuntimeError: input image is smaller than kernel (shape_check at mmdet/ops/dcn/src/deform_conv_cuda.cpp:127)

This error means in one of the layer the feature size is smaller than the kernel size. It depends on feature map not the input, please check the code. Did you modify the origin network structure?

muditchaudhary commented 5 years ago

I didn't modify the orginal network structure. I'll try to check it one more time.

But if its a problem in the layer, why does the training run for about half epoch?

muditchaudhary commented 5 years ago

The error shows somewhat like this:

Another person is getting the same error for testing on reppoints: https://github.com/microsoft/RepPoints/issues/13

gpu-slurm:/research/byu2/mudit7/FYP/RepPoints> bash ./mmdetection/tools/train_slurm_fyp.sh
+ PARTITION=gpu_8h
+ JOB_NAME=reppoints_moment_r50_fpn_2x_FYP
+ CONFIG=./configs/reppoints_moment_r50_fpn_1x.py
+ WORK_DIR=./work_dirs/reppoints_moment_r50_fpn_1x_test1
+ GPUS=2
+ GPUS_PER_NODE=2
+ CPUS_PER_TASK=5
+ SRUN_ARGS=
+ PY_ARGS=--validate
+ srun -p gpu_8h --job-name=reppoints_moment_r50_fpn_2x_FYP --gres=gpu:2 --ntasks=2 --ntasks-per-node=2 --cpus-per-task=5 --kill-on-bad-exit=1 python -u ./mmdetection/tools/train.py ./configs
/reppoints_moment_r50_fpn_1x.py --work_dir=./work_dirs/reppoints_moment_r50_fpn_1x_test1 --launcher=slurm
srun: job 22168 queued and waiting for resources
srun: job 22168 has been allocated resources
2019-10-30 08:09:07,743 - INFO - Distributed training: True
2019-10-30 08:09:08,441 - INFO - load model from: modelzoo://resnet50
2019-10-30 08:09:15,247 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

loading annotations into memory...
loading annotations into memory...
Done (t=16.61s)
creating index...
Done (t=16.56s)
creating index...
index created!
index created!
2019-10-30 08:09:33,988 - INFO - Start running, host: mudit7@gpu37.cse.cuhk.edu.hk, work_dir: /research/byu2/mudit7/FYP/RepPoints/work_dirs/reppoints_moment_r50_fpn_1x_test1
2019-10-30 08:09:33,988 - INFO - workflow: [('train', 1)], max: 12 epochs
2019-10-30 08:10:29,381 - INFO - Epoch [1][100/58633]   lr: 0.00233, eta: 4 days, 12:14:23, time: 0.554, data_time: 0.225, memory: 2097, loss_cls: 1.2609, loss_pts_init: 0.6305, loss_pts_refi
ne: 0.7354, loss: 2.6268
2019-10-30 08:11:02,426 - INFO - Epoch [1][200/58633]   lr: 0.00299, eta: 3 days, 14:23:28, time: 0.330, data_time: 0.006, memory: 2099, loss_cls: 1.0989, loss_pts_init: 0.4087, loss_pts_refi
ne: 0.7389, loss: 2.2465
2019-10-30 08:11:34,777 - INFO - Epoch [1][300/58633]   lr: 0.00366, eta: 3 days, 6:39:14, time: 0.324, data_time: 0.004, memory: 2099, loss_cls: 1.0646, loss_pts_init: 0.4074, loss_pts_refin
e: 0.7712, loss: 2.2432
2019-10-30 08:12:08,559 - INFO - Epoch [1][400/58633]   lr: 0.00433, eta: 3 days, 3:28:42, time: 0.338, data_time: 0.011, memory: 2099, loss_cls: 1.0346, loss_pts_init: 0.4188, loss_pts_refin
e: 0.7522, loss: 2.2056
.
.
.
.
2019-10-30 09:42:40,370 - INFO - Epoch [1][15900/58633] lr: 0.00500, eta: 2 days, 19:06:52, time: 0.348, data_time: 0.005, memory: 2101, loss_cls: 0.5181, loss_pts_init: 0.1558, loss_pts_refine: 0.3435, loss: 1.0175
2019-10-30 09:43:16,372 - INFO - Epoch [1][16000/58633] lr: 0.00500, eta: 2 days, 19:06:54, time: 0.360, data_time: 0.016, memory: 2101, loss_cls: 0.5080, loss_pts_init: 0.1507, loss_pts_refine: 0.3290, loss: 0.9878
2019-10-30 09:43:51,618 - INFO - Epoch [1][16100/58633] lr: 0.00500, eta: 2 days, 19:06:24, time: 0.352, data_time: 0.005, memory: 2101, loss_cls: 0.5510, loss_pts_init: 0.1537, loss_pts_refine: 0.3269, loss: 1.0316
Traceback (most recent call last):
  File "./mmdetection/tools/train.py", line 108, in <module>
    main()
  File "./mmdetection/tools/train.py", line 104, in main
    logger=logger)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 58, in train_detector
    _dist_train(model, dataset, cfg, validate=validate)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 186, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 38, in batch_processor
    losses = model(**data)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 38, in batch_processor
    losses = model(**data)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 50, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
    return self.forward_train(img, img_meta, **kwargs)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/detectors/single_stage.py", line 52, in forward_train
    outs = self.bbox_head(x)
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/anchor_heads/reppoints_head.py", line 291, in forward
    return multi_apply(self.forward_single, feats)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/utils/misc.py", line 24, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/anchor_heads/reppoints_head.py", line 280, in forward_single
    self.relu(self.reppoints_cls_conv(cls_feat, dcn_offset)))
  File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 236, in forward
    self.dilation, self.groups, self.deformable_groups)
  File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 55, in forward
    cur_im2col_step)
RuntimeError: input image is smaller than kernel (shape_check at mmdet/ops/dcn/src/deform_conv_cuda.cpp:127)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f24f23eddc5 in /research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: shape_check(at::Tensor, at::Tensor, at::Tensor*, at::Tensor, int, int, int, int, int, int, int, int, int, int) + 0x6dd (0x7f24a9e6d5bd in /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #2: deform_conv_forward_cuda(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, int, int, int, int, int, int, int, int, int, int, int) + 0xcf (0x7f24a9e6e28f in /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so)

muditchaudhary commented 4 years ago

The solution to this problem has been described in this issue: https://github.com/open-mmlab/mmdetection/issues/1453

Explanation at: https://github.com/open-mmlab/mmdetection/issues/1453#issuecomment-538602013

muditchaudhary / RepPoints-x-Libra-R-CNN-x-Transformer-self-attention

Minival Coco training error #3