Closed muditchaudhary closed 4 years ago
data = dict( imgs_per_gpu=4, workers_per_gpu=4, train=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_train2017.json', img_prefix=data_root + 'train2017/', img_scale=(1333, 800), img_norm_cfg=img_norm_cfg, size_divisor=32, flip_ratio=0.5, with_mask=False, with_crowd=False, with_label=True), val=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_val2017.json', img_prefix=data_root + 'val2017/', img_scale=(1333, 800), img_norm_cfg=img_norm_cfg, size_divisor=32, flip_ratio=0, with_mask=False, with_crowd=False, with_label=True),
you should download train set from coco page. The minival is for testing.
Noted. Thanks
trainval35k is for training (115k images) and minival (5k images) for testing. Both have annotations.
I tried the configuration you provided by just modifying the batch size but it is still giving me the same error after few steps of training.
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
data = dict(
imgs_per_gpu=1,
workers_per_gpu=1,
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_train2017.json',
img_prefix=data_root + 'train2017/',
img_scale=(1333, 800),
img_norm_cfg=img_norm_cfg,
size_divisor=32,
flip_ratio=0.5,
with_mask=False,
with_crowd=False,
with_label=True),
val=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_val2017.json',
img_prefix=data_root + 'val2017/',
img_scale=(1333, 800),
img_norm_cfg=img_norm_cfg,
size_divisor=32,
flip_ratio=0,
with_mask=False,
with_crowd=False,
with_label=True),
test=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_val2017.json',
img_prefix=data_root + 'val2017/',
img_scale=(1333, 800),
img_norm_cfg=img_norm_cfg,
size_divisor=32,
flip_ratio=0,
with_mask=False,
with_crowd=False,
with_label=False,
test_mode=True))
RuntimeError: input image is smaller than kernel (shape_check at mmdet/ops/dcn/src/deform_conv_cuda.cpp:127)
This error means in one of the layer the feature size is smaller than the kernel size. It depends on feature map not the input, please check the code. Did you modify the origin network structure?
I didn't modify the orginal network structure. I'll try to check it one more time.
But if its a problem in the layer, why does the training run for about half epoch?
The error shows somewhat like this:
Another person is getting the same error for testing on reppoints: https://github.com/microsoft/RepPoints/issues/13
gpu-slurm:/research/byu2/mudit7/FYP/RepPoints> bash ./mmdetection/tools/train_slurm_fyp.sh
+ PARTITION=gpu_8h
+ JOB_NAME=reppoints_moment_r50_fpn_2x_FYP
+ CONFIG=./configs/reppoints_moment_r50_fpn_1x.py
+ WORK_DIR=./work_dirs/reppoints_moment_r50_fpn_1x_test1
+ GPUS=2
+ GPUS_PER_NODE=2
+ CPUS_PER_TASK=5
+ SRUN_ARGS=
+ PY_ARGS=--validate
+ srun -p gpu_8h --job-name=reppoints_moment_r50_fpn_2x_FYP --gres=gpu:2 --ntasks=2 --ntasks-per-node=2 --cpus-per-task=5 --kill-on-bad-exit=1 python -u ./mmdetection/tools/train.py ./configs
/reppoints_moment_r50_fpn_1x.py --work_dir=./work_dirs/reppoints_moment_r50_fpn_1x_test1 --launcher=slurm
srun: job 22168 queued and waiting for resources
srun: job 22168 has been allocated resources
2019-10-30 08:09:07,743 - INFO - Distributed training: True
2019-10-30 08:09:08,441 - INFO - load model from: modelzoo://resnet50
2019-10-30 08:09:15,247 - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc.weight, fc.bias
loading annotations into memory...
loading annotations into memory...
Done (t=16.61s)
creating index...
Done (t=16.56s)
creating index...
index created!
index created!
2019-10-30 08:09:33,988 - INFO - Start running, host: mudit7@gpu37.cse.cuhk.edu.hk, work_dir: /research/byu2/mudit7/FYP/RepPoints/work_dirs/reppoints_moment_r50_fpn_1x_test1
2019-10-30 08:09:33,988 - INFO - workflow: [('train', 1)], max: 12 epochs
2019-10-30 08:10:29,381 - INFO - Epoch [1][100/58633] lr: 0.00233, eta: 4 days, 12:14:23, time: 0.554, data_time: 0.225, memory: 2097, loss_cls: 1.2609, loss_pts_init: 0.6305, loss_pts_refi
ne: 0.7354, loss: 2.6268
2019-10-30 08:11:02,426 - INFO - Epoch [1][200/58633] lr: 0.00299, eta: 3 days, 14:23:28, time: 0.330, data_time: 0.006, memory: 2099, loss_cls: 1.0989, loss_pts_init: 0.4087, loss_pts_refi
ne: 0.7389, loss: 2.2465
2019-10-30 08:11:34,777 - INFO - Epoch [1][300/58633] lr: 0.00366, eta: 3 days, 6:39:14, time: 0.324, data_time: 0.004, memory: 2099, loss_cls: 1.0646, loss_pts_init: 0.4074, loss_pts_refin
e: 0.7712, loss: 2.2432
2019-10-30 08:12:08,559 - INFO - Epoch [1][400/58633] lr: 0.00433, eta: 3 days, 3:28:42, time: 0.338, data_time: 0.011, memory: 2099, loss_cls: 1.0346, loss_pts_init: 0.4188, loss_pts_refin
e: 0.7522, loss: 2.2056
.
.
.
.
2019-10-30 09:42:40,370 - INFO - Epoch [1][15900/58633] lr: 0.00500, eta: 2 days, 19:06:52, time: 0.348, data_time: 0.005, memory: 2101, loss_cls: 0.5181, loss_pts_init: 0.1558, loss_pts_refine: 0.3435, loss: 1.0175
2019-10-30 09:43:16,372 - INFO - Epoch [1][16000/58633] lr: 0.00500, eta: 2 days, 19:06:54, time: 0.360, data_time: 0.016, memory: 2101, loss_cls: 0.5080, loss_pts_init: 0.1507, loss_pts_refine: 0.3290, loss: 0.9878
2019-10-30 09:43:51,618 - INFO - Epoch [1][16100/58633] lr: 0.00500, eta: 2 days, 19:06:24, time: 0.352, data_time: 0.005, memory: 2101, loss_cls: 0.5510, loss_pts_init: 0.1537, loss_pts_refine: 0.3269, loss: 1.0316
Traceback (most recent call last):
File "./mmdetection/tools/train.py", line 108, in <module>
main()
File "./mmdetection/tools/train.py", line 104, in main
logger=logger)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 58, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 186, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/apis/train.py", line 38, in batch_processor
losses = model(**data)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(*inputs[0], **kwargs[0])
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/detectors/single_stage.py", line 52, in forward_train
outs = self.bbox_head(x)
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/anchor_heads/reppoints_head.py", line 291, in forward
return multi_apply(self.forward_single, feats)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/core/utils/misc.py", line 24, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/models/anchor_heads/reppoints_head.py", line 280, in forward_single
self.relu(self.reppoints_cls_conv(cls_feat, dcn_offset)))
File "/research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 236, in forward
self.dilation, self.groups, self.deformable_groups)
File "/research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv.py", line 55, in forward
cur_im2col_step)
RuntimeError: input image is smaller than kernel (shape_check at mmdet/ops/dcn/src/deform_conv_cuda.cpp:127)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f24f23eddc5 in /research/byu2/mudit7/anaconda3/envs/pytorch12/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: shape_check(at::Tensor, at::Tensor, at::Tensor*, at::Tensor, int, int, int, int, int, int, int, int, int, int) + 0x6dd (0x7f24a9e6d5bd in /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #2: deform_conv_forward_cuda(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor, int, int, int, int, int, int, int, int, int, int, int) + 0xcf (0x7f24a9e6e28f in /research/byu2/mudit7/FYP/RepPoints/mmdetection/mmdet/ops/dcn/deform_conv_cuda.cpython-37m-x86_64-linux-gnu.so)
The solution to this problem has been described in this issue: https://github.com/open-mmlab/mmdetection/issues/1453
Explanation at: https://github.com/open-mmlab/mmdetection/issues/1453#issuecomment-538602013
I have added the minival coco2014 dataset.
I am trying to train on valminusminival. I have modified by config file as follows:
The model trains for a few steps (Epoch[1] (4900/17593)) and then gives the following error:
I believe it has to do something image_scale or size_divisor? How can I debug this?