open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.57k stars 9.46k forks source link

pls help me, i can't train the model on custom dataset. #7488

Open IvanMutus opened 2 years ago

IvanMutus commented 2 years ago

Hello, I tried to change almost all the configuration parameters, but nothing helps. The training always stops at 200 iterations, I've spent over 100 hours trying to figure out why. 1400 photos in the train and 450 for validation. I adapted my custom dataset I adapted the structure of my dataset to the Kitty_tiny dataset, as shown in the DEMO in mmdetection, in order to train the model exactly as shown in the DEMO MODEL - SSD300 The training always stops at 200 iterations, i don't know why((((((

tmp/ipykernel_3063/3895221999.py:58: DeprecationWarning: np.long is a deprecated alias for np.compat.long. To silence this warning, use np.compat.long by itself. In the likely event your code does not need to work on Python 2 you can use the builtin int for which np.compat.long is itself an alias. Doing this will not modify any behaviour and is safe. When replacing np.long, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations labels=np.array(gt_labels, dtype=np.long), /tmp/ipykernel_3063/3895221999.py:61: DeprecationWarning: np.long is a deprecated alias for np.compat.long. To silence this warning, use np.compat.long by itself. In the likely event your code does not need to work on Python 2 you can use the builtin int for which np.compat.long is itself an alias. Doing this will not modify any behaviour and is safe. When replacing np.long, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations labels_ignore=np.array(gt_labels_ignore, dtype=np.long)) /home/ivan/Рабочий стол/mmdetection/mmdet/datasets/custom.py:179: UserWarning: CustomDataset does not support filtering empty gt images. warnings.warn( 2022-03-21 21:46:05,961 - mmdet - INFO - load checkpoint from local path: checkpoints/test.pth 2022-03-21 21:46:07,965 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for bbox_head.cls_convs.0.0.weight: copying a param with shape torch.Size([324, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 512, 3, 3]). size mismatch for bbox_head.cls_convs.0.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for bbox_head.cls_convs.1.0.weight: copying a param with shape torch.Size([486, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 1024, 3, 3]). size mismatch for bbox_head.cls_convs.1.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for bbox_head.cls_convs.2.0.weight: copying a param with shape torch.Size([486, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 512, 3, 3]). size mismatch for bbox_head.cls_convs.2.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for bbox_head.cls_convs.3.0.weight: copying a param with shape torch.Size([486, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 256, 3, 3]). size mismatch for bbox_head.cls_convs.3.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for bbox_head.cls_convs.4.0.weight: copying a param with shape torch.Size([324, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 256, 3, 3]). size mismatch for bbox_head.cls_convs.4.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for bbox_head.cls_convs.5.0.weight: copying a param with shape torch.Size([324, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 256, 3, 3]). size mismatch for bbox_head.cls_convs.5.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([64]). 2022-03-21 21:46:07,966 - mmdet - INFO - Start running, host: ivan@ivan-GL73-8RC, work_dir: /home/ivan/Рабочий стол/mmdetection/tutorial_exps 2022-03-21 21:46:07,966 - mmdet - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook


before_train_epoch: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook


before_train_iter: (VERY_HIGH ) StepLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook


after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) CheckInvalidLossHook


after_train_epoch: (NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook


before_val_epoch: (NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook


before_val_iter: (LOW ) IterTimerHook


after_val_iter: (LOW ) IterTimerHook


after_val_epoch: (VERY_LOW ) TextLoggerHook


after_run: (VERY_LOW ) TextLoggerHook


2022-03-21 21:46:07,967 - mmdet - INFO - workflow: [('train', 1)], max: 24 epochs 2022-03-21 21:46:07,967 - mmdet - INFO - Checkpoints will be saved to /home/ivan/Рабочий стол/mmdetection/tutorial_exps by HardDiskBackend. /home/ivan/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) 2022-03-21 21:46:36,019 - mmdet - INFO - Epoch [1][10/881] lr: 2.500e-03, eta: 15:59:09, time: 2.723, data_time: 0.817, memory: 1782, loss_cls: 9.9198, loss_bbox: 5.8576, loss: 15.7773 2022-03-21 21:46:53,797 - mmdet - INFO - Epoch [1][20/881] lr: 2.500e-03, eta: 13:15:12, time: 1.794, data_time: 1.053, memory: 1782, loss_cls: 6.9702, loss_bbox: 7.9114, loss: 14.8815 2022-03-21 21:47:13,140 - mmdet - INFO - Epoch [1][30/881] lr: 2.500e-03, eta: 12:37:01, time: 1.936, data_time: 1.203, memory: 1782, loss_cls: 6.8268, loss_bbox: 7.8771, loss: 14.7039 2022-03-21 21:47:43,220 - mmdet - INFO - Epoch [1][40/881] lr: 2.500e-03, eta: 13:43:44, time: 2.914, data_time: 2.131, memory: 1782, loss_cls: 5.4283, loss_bbox: 7.3805, loss: 12.8088 2022-03-21 21:48:02,945 - mmdet - INFO - Epoch [1][50/881] lr: 2.500e-03, eta: 13:24:00, time: 2.067, data_time: 1.381, memory: 1782, loss_cls: 5.2074, loss_bbox: 7.2640, loss: 12.4714 2022-03-21 21:48:19,138 - mmdet - INFO - Epoch [1][60/881] lr: 2.500e-03, eta: 12:44:30, time: 1.619, data_time: 0.859, memory: 1782, loss_cls: 5.2744, loss_bbox: 7.4937, loss: 12.7681 2022-03-21 21:48:36,040 - mmdet - INFO - Epoch [1][70/881] lr: 2.500e-03, eta: 12:19:47, time: 1.690, data_time: 1.008, memory: 1782, loss_cls: 5.6761, loss_bbox: 7.8002, loss: 13.4763 2022-03-21 21:48:55,586 - mmdet - INFO - Epoch [1][80/881] lr: 2.500e-03, eta: 12:12:47, time: 1.955, data_time: 1.255, memory: 1782, loss_cls: 5.1228, loss_bbox: 8.7396, loss: 13.8623 2022-03-21 21:49:16,196 - mmdet - INFO - Epoch [1][90/881] lr: 2.500e-03, eta: 12:11:24, time: 2.060, data_time: 1.383, memory: 1782, loss_cls: 5.6127, loss_bbox: 6.9650, loss: 12.5777

2022-03-21 21:49:29,004 - mmdet - INFO - Epoch [1][100/881] lr: 2.500e-03, eta: 11:42:51, time: 1.281, data_time: 0.631, memory: 1782, loss_cls: 5.2070, loss_bbox: 6.3978, loss: 11.6048 2022-03-21 21:49:45,604 - mmdet - INFO - Epoch [1][110/881] lr: 2.500e-03, eta: 11:31:33, time: 1.660, data_time: 0.957, memory: 1782, loss_cls: 5.5467, loss_bbox: 6.4412, loss: 11.9879 2022-03-21 21:49:58,321 - mmdet - INFO - Epoch [1][120/881] lr: 2.500e-03, eta: 11:10:45, time: 1.272, data_time: 0.534, memory: 1782, loss_cls: 5.1328, loss_bbox: 5.7493, loss: 10.8820 2022-03-21 21:50:16,054 - mmdet - INFO - Epoch [1][130/881] lr: 2.500e-03, eta: 11:06:39, time: 1.774, data_time: 1.048, memory: 1782, loss_cls: 4.5745, loss_bbox: 5.6167, loss: 10.1911 2022-03-21 21:50:31,525 - mmdet - INFO - Epoch [1][140/881] lr: 2.500e-03, eta: 10:57:24, time: 1.547, data_time: 0.810, memory: 1782, loss_cls: 4.9017, loss_bbox: 6.4228, loss: 11.3245 2022-03-21 21:50:45,390 - mmdet - INFO - Epoch [1][150/881] lr: 2.500e-03, eta: 10:45:37, time: 1.386, data_time: 0.842, memory: 1782, loss_cls: 4.7573, loss_bbox: 5.2906, loss: 10.0479 2022-03-21 21:51:03,105 - mmdet - INFO - Epoch [1][160/881] lr: 2.500e-03, eta: 10:43:42, time: 1.771, data_time: 1.122, memory: 1782, loss_cls: 4.5290, loss_bbox: 5.5199, loss: 10.0489 2022-03-21 21:51:23,782 - mmdet - INFO - Epoch [1][170/881] lr: 2.500e-03, eta: 10:47:59, time: 2.064, data_time: 1.287, memory: 1782, loss_cls: 4.8839, loss_bbox: 5.0466, loss: 9.9305 2022-03-21 21:51:40,147 - mmdet - INFO - Epoch [1][180/881] lr: 2.500e-03, eta: 10:43:26, time: 1.636, data_time: 1.056, memory: 1782, loss_cls: 4.4631, loss_bbox: 5.1512, loss: 9.6143 2022-03-21 21:51:57,418 - mmdet - INFO - Epoch [1][190/881] lr: 2.500e-03, eta: 10:41:08, time: 1.733, data_time: 1.045, memory: 1782, loss_cls: 4.4632, loss_bbox: 5.4469, loss: 9.9100 2022-03-21 21:52:14,014 - mmdet - INFO - Epoch [1][200/881] lr: 2.500e-03, eta: 10:37:46, time: 1.661, data_time: 0.895, memory: 1782, loss_cls: nan, loss_bbox: nan, loss: nan 2022-03-21 21:52:14,738 - mmdet - INFO - loss become infinite or NaN!

AssertionError Traceback (most recent call last) /tmp/ipykernel_3063/79739429.py in 23 # Create work_diri 24 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir)) ---> 25 train_detector(model, datasets, cfg, distributed=False, validate=True) 26

~/Рабочий стол/mmdetection/mmdet/apis/train.py in train_detector(model, dataset, cfg, distributed, validate, timestamp, meta) 206 elif cfg.load_from: 207 runner.load_checkpoint(cfg.load_from) --> 208 runner.run(data_loaders, cfg.workflow)

~/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py in run(self, data_loaders, workflow, max_epochs, kwargs) 125 if mode == 'train' and self.epoch >= self._max_epochs: 126 break --> 127 epoch_runner(data_loaders[i], kwargs) 128 129 time.sleep(1) # wait for some hooks like loggers to finish

~/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py in train(self, data_loader, kwargs) 49 self.call_hook('before_train_iter') 50 self.run_iter(data_batch, train_mode=True, kwargs) ---> 51 self.call_hook('after_train_iter') 52 self._iter += 1 53

~/anaconda3/lib/python3.9/site-packages/mmcv/runner/base_runner.py in call_hook(self, fn_name) 307 """ 308 for hook in self._hooks: --> 309 getattr(hook, fn_name)(self) 310 311 def get_hook_info(self):

~/Рабочий стол/mmdetection/mmdet/core/hook/checkloss_hook.py in after_train_iter(self, runner) 21 def after_train_iter(self, runner): 22 if self.every_n_iters(runner, self.interval): ---> 23 assert torch.isfinite(runner.outputs['loss']), \ 24 runner.logger.info('loss become infinite or NaN!')

Config: input_size = 300 model = dict( type='SingleStageDetector', backbone=dict( type='SSDVGG', depth=16, with_last_pool=False, ceil_mode=True, out_indices=(3, 4), out_feature_indices=(22, 34), init_cfg=dict( type='Pretrained', checkpoint='open-mmlab://vgg16_caffe')), neck=dict( type='SSDNeck', in_channels=(512, 1024), out_channels=(512, 1024, 512, 256, 256, 256), level_strides=(2, 2, 1, 1), level_paddings=(1, 1, 0, 0), l2_norm_scale=20), bbox_head=dict( type='SSDHead', in_channels=(512, 1024, 512, 256, 256, 256), num_classes=15, anchor_generator=dict( type='SSDAnchorGenerator', scale_major=False, input_size=300, basesize_ratio_range=(0.15, 0.9), strides=[8, 16, 32, 64, 100, 300], ratios=[[2], [2, 3], [2, 3], [2, 3], [2], [2]]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2])), train_cfg=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.0, ignore_iof_thr=-1, gt_max_assign_all=False), smoothl1_beta=1.0, allowed_border=-1, pos_weight=-1, neg_pos_ratio=3, debug=False), test_cfg=dict( nms_pre=1000, nms=dict(type='nms', iou_threshold=0.45), min_bbox_size=0, score_thr=0.02, max_per_img=200)) cudnn_benchmark = True dataset_type = 'DOTA' data_root = 'AdaptedDataset/' img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[123.675, 116.28, 103.53], to_rgb=True, ratio_range=(1, 4)), dict( type='MinIoURandomCrop', min_ious=(0.1, 0.3, 0.5, 0.7, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(300, 300), keep_ratio=False), dict(type='RandomFlip', flip_ratio=0.5), dict( type='PhotoMetricDistortion', brightness_delta=32, contrast_range=(0.5, 1.5), saturation_range=(0.5, 1.5), hue_delta=18), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=8, workers_per_gpu=3, train=dict( type='RepeatDataset', times=5, dataset=dict( type='DOTA', ann_file='train.txt', img_prefix='training/image_2', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[123.675, 116.28, 103.53], to_rgb=True, ratio_range=(1, 4)), dict( type='MinIoURandomCrop', min_ious=(0.1, 0.3, 0.5, 0.7, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(300, 300), keep_ratio=False), dict(type='RandomFlip', flip_ratio=0.5), dict( type='PhotoMetricDistortion', brightness_delta=32, contrast_range=(0.5, 1.5), saturation_range=(0.5, 1.5), hue_delta=18), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ], data_root='AdaptedDataset/')), val=dict( type='DOTA', ann_file='val.txt', img_prefix='training/image_2', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ], data_root='AdaptedDataset/'), test=dict( type='DOTA', ann_file='train.txt', img_prefix='training/image_2', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ], data_root='AdaptedDataset/')) evaluation = dict(interval=1, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0005) optimizer_config = dict() lr_config = dict( policy='step', warmup=None, warmup_iters=500, warmup_ratio=0.001, step=[16, 22]) runner = dict(type='EpochBasedRunner', max_epochs=24) checkpoint_config = dict(interval=1) log_config = dict(interval=10, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [ dict(type='NumClassCheckHook'), dict(type='CheckInvalidLossHook', interval=50, priority='VERY_LOW') ] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = 'checkpoints/test.pth' resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' work_dir = 'tutorial_exps' seed = 0 gpu_ids = range(0, 1)

annotations:

import copy import os.path as osp

import mmcv import numpy as np

from mmdet.datasets.builder import DATASETS from mmdet.datasets.custom import CustomDataset

@DATASETS.register_module() class DOTA(CustomDataset):

CLASSES = ('ship','small-vehicle','large-vehicle','plane','harbor','storage-tank','tennis-court','bridge',
           'swimming-pool','helicopter','basketball-court','baseball-diamond','roundabout','soccer-ball-field',
           'ground-track-field')

def load_annotations(self, ann_file):
    cat2label = {k: i for i, k in enumerate(self.CLASSES)}
    # load image list from file
    image_list = mmcv.list_from_file(self.ann_file)

    data_infos = []
    # convert annotations to middle format
    for image_id in image_list:
        filename = f'{self.img_prefix}/{image_id}.png'
        image = mmcv.imread(filename)
        height, width = image.shape[:2]

        data_info = dict(filename=f'{image_id}.png', width=width, height=height)

        # load annotations
        label_prefix = self.img_prefix.replace('image_2', 'label_2')
        lines = mmcv.list_from_file(osp.join(label_prefix, f'{image_id}.txt'))

        content = [line.strip().split(' ') for line in lines]
        bbox_names = [x[8] for x in content]
        bboxes = [[float(info) for info in (x[0:2] + x[4:6])] for x in content]

[:n] + l[n:]

        gt_bboxes = []
        gt_labels = []
        gt_bboxes_ignore = []
        gt_labels_ignore = []

        # filter 'DontCare'
        for bbox_name, bbox in zip(bbox_names, bboxes):
            if bbox_name in cat2label:
                gt_labels.append(cat2label[bbox_name])
                gt_bboxes.append(bbox)
            else:
                gt_labels_ignore.append(-1)
                gt_bboxes_ignore.append(bbox)

        data_anno = dict(
            bboxes=np.array(gt_bboxes, dtype=np.float32).reshape(-1, 4),
            labels=np.array(gt_labels, dtype=np.long),
            bboxes_ignore=np.array(gt_bboxes_ignore,
                                   dtype=np.float32).reshape(-1, 4),
            labels_ignore=np.array(gt_labels_ignore, dtype=np.long))

        data_info.update(ann=data_anno)
        data_infos.append(data_info)

    return data_infos
Czm369 commented 2 years ago

You can check whether the gt after data enhancement is normal and whether there is a division by zero error during training.

IvanMutus commented 2 years ago

You can check whether the gt after data enhancement is normal and whether there is a division by zero error during training. How i can do it?