Open IvanMutus opened 2 years ago
You can check whether the gt after data enhancement is normal and whether there is a division by zero error during training.
You can check whether the gt after data enhancement is normal and whether there is a division by zero error during training. How i can do it?
Hello, I tried to change almost all the configuration parameters, but nothing helps. The training always stops at 200 iterations, I've spent over 100 hours trying to figure out why. 1400 photos in the train and 450 for validation. I adapted my custom dataset I adapted the structure of my dataset to the Kitty_tiny dataset, as shown in the DEMO in mmdetection, in order to train the model exactly as shown in the DEMO MODEL - SSD300 The training always stops at 200 iterations, i don't know why((((((
tmp/ipykernel_3063/3895221999.py:58: DeprecationWarning:
np.long
is a deprecated alias fornp.compat.long
. To silence this warning, usenp.compat.long
by itself. In the likely event your code does not need to work on Python 2 you can use the builtinint
for whichnp.compat.long
is itself an alias. Doing this will not modify any behaviour and is safe. When replacingnp.long
, you may wish to use e.g.np.int64
ornp.int32
to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations labels=np.array(gt_labels, dtype=np.long), /tmp/ipykernel_3063/3895221999.py:61: DeprecationWarning:np.long
is a deprecated alias fornp.compat.long
. To silence this warning, usenp.compat.long
by itself. In the likely event your code does not need to work on Python 2 you can use the builtinint
for whichnp.compat.long
is itself an alias. Doing this will not modify any behaviour and is safe. When replacingnp.long
, you may wish to use e.g.np.int64
ornp.int32
to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations labels_ignore=np.array(gt_labels_ignore, dtype=np.long)) /home/ivan/Рабочий стол/mmdetection/mmdet/datasets/custom.py:179: UserWarning: CustomDataset does not support filtering empty gt images. warnings.warn( 2022-03-21 21:46:05,961 - mmdet - INFO - load checkpoint from local path: checkpoints/test.pth 2022-03-21 21:46:07,965 - mmdet - WARNING - The model and loaded state dict do not match exactlysize mismatch for bbox_head.cls_convs.0.0.weight: copying a param with shape torch.Size([324, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 512, 3, 3]). size mismatch for bbox_head.cls_convs.0.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for bbox_head.cls_convs.1.0.weight: copying a param with shape torch.Size([486, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 1024, 3, 3]). size mismatch for bbox_head.cls_convs.1.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for bbox_head.cls_convs.2.0.weight: copying a param with shape torch.Size([486, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 512, 3, 3]). size mismatch for bbox_head.cls_convs.2.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for bbox_head.cls_convs.3.0.weight: copying a param with shape torch.Size([486, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 256, 3, 3]). size mismatch for bbox_head.cls_convs.3.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for bbox_head.cls_convs.4.0.weight: copying a param with shape torch.Size([324, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 256, 3, 3]). size mismatch for bbox_head.cls_convs.4.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([64]). size mismatch for bbox_head.cls_convs.5.0.weight: copying a param with shape torch.Size([324, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([64, 256, 3, 3]). size mismatch for bbox_head.cls_convs.5.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([64]). 2022-03-21 21:46:07,966 - mmdet - INFO - Start running, host: ivan@ivan-GL73-8RC, work_dir: /home/ivan/Рабочий стол/mmdetection/tutorial_exps 2022-03-21 21:46:07,966 - mmdet - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
before_train_epoch: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
before_train_iter: (VERY_HIGH ) StepLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook
after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) CheckInvalidLossHook
after_train_epoch: (NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
before_val_epoch: (NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
before_val_iter: (LOW ) IterTimerHook
after_val_iter: (LOW ) IterTimerHook
after_val_epoch: (VERY_LOW ) TextLoggerHook
after_run: (VERY_LOW ) TextLoggerHook
2022-03-21 21:46:07,967 - mmdet - INFO - workflow: [('train', 1)], max: 24 epochs 2022-03-21 21:46:07,967 - mmdet - INFO - Checkpoints will be saved to /home/ivan/Рабочий стол/mmdetection/tutorial_exps by HardDiskBackend. /home/ivan/anaconda3/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) 2022-03-21 21:46:36,019 - mmdet - INFO - Epoch [1][10/881] lr: 2.500e-03, eta: 15:59:09, time: 2.723, data_time: 0.817, memory: 1782, loss_cls: 9.9198, loss_bbox: 5.8576, loss: 15.7773 2022-03-21 21:46:53,797 - mmdet - INFO - Epoch [1][20/881] lr: 2.500e-03, eta: 13:15:12, time: 1.794, data_time: 1.053, memory: 1782, loss_cls: 6.9702, loss_bbox: 7.9114, loss: 14.8815 2022-03-21 21:47:13,140 - mmdet - INFO - Epoch [1][30/881] lr: 2.500e-03, eta: 12:37:01, time: 1.936, data_time: 1.203, memory: 1782, loss_cls: 6.8268, loss_bbox: 7.8771, loss: 14.7039 2022-03-21 21:47:43,220 - mmdet - INFO - Epoch [1][40/881] lr: 2.500e-03, eta: 13:43:44, time: 2.914, data_time: 2.131, memory: 1782, loss_cls: 5.4283, loss_bbox: 7.3805, loss: 12.8088 2022-03-21 21:48:02,945 - mmdet - INFO - Epoch [1][50/881] lr: 2.500e-03, eta: 13:24:00, time: 2.067, data_time: 1.381, memory: 1782, loss_cls: 5.2074, loss_bbox: 7.2640, loss: 12.4714 2022-03-21 21:48:19,138 - mmdet - INFO - Epoch [1][60/881] lr: 2.500e-03, eta: 12:44:30, time: 1.619, data_time: 0.859, memory: 1782, loss_cls: 5.2744, loss_bbox: 7.4937, loss: 12.7681 2022-03-21 21:48:36,040 - mmdet - INFO - Epoch [1][70/881] lr: 2.500e-03, eta: 12:19:47, time: 1.690, data_time: 1.008, memory: 1782, loss_cls: 5.6761, loss_bbox: 7.8002, loss: 13.4763 2022-03-21 21:48:55,586 - mmdet - INFO - Epoch [1][80/881] lr: 2.500e-03, eta: 12:12:47, time: 1.955, data_time: 1.255, memory: 1782, loss_cls: 5.1228, loss_bbox: 8.7396, loss: 13.8623 2022-03-21 21:49:16,196 - mmdet - INFO - Epoch [1][90/881] lr: 2.500e-03, eta: 12:11:24, time: 2.060, data_time: 1.383, memory: 1782, loss_cls: 5.6127, loss_bbox: 6.9650, loss: 12.5777
2022-03-21 21:49:29,004 - mmdet - INFO - Epoch [1][100/881] lr: 2.500e-03, eta: 11:42:51, time: 1.281, data_time: 0.631, memory: 1782, loss_cls: 5.2070, loss_bbox: 6.3978, loss: 11.6048 2022-03-21 21:49:45,604 - mmdet - INFO - Epoch [1][110/881] lr: 2.500e-03, eta: 11:31:33, time: 1.660, data_time: 0.957, memory: 1782, loss_cls: 5.5467, loss_bbox: 6.4412, loss: 11.9879 2022-03-21 21:49:58,321 - mmdet - INFO - Epoch [1][120/881] lr: 2.500e-03, eta: 11:10:45, time: 1.272, data_time: 0.534, memory: 1782, loss_cls: 5.1328, loss_bbox: 5.7493, loss: 10.8820 2022-03-21 21:50:16,054 - mmdet - INFO - Epoch [1][130/881] lr: 2.500e-03, eta: 11:06:39, time: 1.774, data_time: 1.048, memory: 1782, loss_cls: 4.5745, loss_bbox: 5.6167, loss: 10.1911 2022-03-21 21:50:31,525 - mmdet - INFO - Epoch [1][140/881] lr: 2.500e-03, eta: 10:57:24, time: 1.547, data_time: 0.810, memory: 1782, loss_cls: 4.9017, loss_bbox: 6.4228, loss: 11.3245 2022-03-21 21:50:45,390 - mmdet - INFO - Epoch [1][150/881] lr: 2.500e-03, eta: 10:45:37, time: 1.386, data_time: 0.842, memory: 1782, loss_cls: 4.7573, loss_bbox: 5.2906, loss: 10.0479 2022-03-21 21:51:03,105 - mmdet - INFO - Epoch [1][160/881] lr: 2.500e-03, eta: 10:43:42, time: 1.771, data_time: 1.122, memory: 1782, loss_cls: 4.5290, loss_bbox: 5.5199, loss: 10.0489 2022-03-21 21:51:23,782 - mmdet - INFO - Epoch [1][170/881] lr: 2.500e-03, eta: 10:47:59, time: 2.064, data_time: 1.287, memory: 1782, loss_cls: 4.8839, loss_bbox: 5.0466, loss: 9.9305 2022-03-21 21:51:40,147 - mmdet - INFO - Epoch [1][180/881] lr: 2.500e-03, eta: 10:43:26, time: 1.636, data_time: 1.056, memory: 1782, loss_cls: 4.4631, loss_bbox: 5.1512, loss: 9.6143 2022-03-21 21:51:57,418 - mmdet - INFO - Epoch [1][190/881] lr: 2.500e-03, eta: 10:41:08, time: 1.733, data_time: 1.045, memory: 1782, loss_cls: 4.4632, loss_bbox: 5.4469, loss: 9.9100 2022-03-21 21:52:14,014 - mmdet - INFO - Epoch [1][200/881] lr: 2.500e-03, eta: 10:37:46, time: 1.661, data_time: 0.895, memory: 1782, loss_cls: nan, loss_bbox: nan, loss: nan 2022-03-21 21:52:14,738 - mmdet - INFO - loss become infinite or NaN!
AssertionError Traceback (most recent call last) /tmp/ipykernel_3063/79739429.py in
23 # Create work_diri
24 mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
---> 25 train_detector(model, datasets, cfg, distributed=False, validate=True)
26
~/Рабочий стол/mmdetection/mmdet/apis/train.py in train_detector(model, dataset, cfg, distributed, validate, timestamp, meta) 206 elif cfg.load_from: 207 runner.load_checkpoint(cfg.load_from) --> 208 runner.run(data_loaders, cfg.workflow)
~/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py in run(self, data_loaders, workflow, max_epochs, kwargs) 125 if mode == 'train' and self.epoch >= self._max_epochs: 126 break --> 127 epoch_runner(data_loaders[i], kwargs) 128 129 time.sleep(1) # wait for some hooks like loggers to finish
~/anaconda3/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py in train(self, data_loader, kwargs) 49 self.call_hook('before_train_iter') 50 self.run_iter(data_batch, train_mode=True, kwargs) ---> 51 self.call_hook('after_train_iter') 52 self._iter += 1 53
~/anaconda3/lib/python3.9/site-packages/mmcv/runner/base_runner.py in call_hook(self, fn_name) 307 """ 308 for hook in self._hooks: --> 309 getattr(hook, fn_name)(self) 310 311 def get_hook_info(self):
~/Рабочий стол/mmdetection/mmdet/core/hook/checkloss_hook.py in after_train_iter(self, runner) 21 def after_train_iter(self, runner): 22 if self.every_n_iters(runner, self.interval): ---> 23 assert torch.isfinite(runner.outputs['loss']), \ 24 runner.logger.info('loss become infinite or NaN!')
Config: input_size = 300 model = dict( type='SingleStageDetector', backbone=dict( type='SSDVGG', depth=16, with_last_pool=False, ceil_mode=True, out_indices=(3, 4), out_feature_indices=(22, 34), init_cfg=dict( type='Pretrained', checkpoint='open-mmlab://vgg16_caffe')), neck=dict( type='SSDNeck', in_channels=(512, 1024), out_channels=(512, 1024, 512, 256, 256, 256), level_strides=(2, 2, 1, 1), level_paddings=(1, 1, 0, 0), l2_norm_scale=20), bbox_head=dict( type='SSDHead', in_channels=(512, 1024, 512, 256, 256, 256), num_classes=15, anchor_generator=dict( type='SSDAnchorGenerator', scale_major=False, input_size=300, basesize_ratio_range=(0.15, 0.9), strides=[8, 16, 32, 64, 100, 300], ratios=[[2], [2, 3], [2, 3], [2, 3], [2], [2]]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2])), train_cfg=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.0, ignore_iof_thr=-1, gt_max_assign_all=False), smoothl1_beta=1.0, allowed_border=-1, pos_weight=-1, neg_pos_ratio=3, debug=False), test_cfg=dict( nms_pre=1000, nms=dict(type='nms', iou_threshold=0.45), min_bbox_size=0, score_thr=0.02, max_per_img=200)) cudnn_benchmark = True dataset_type = 'DOTA' data_root = 'AdaptedDataset/' img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[123.675, 116.28, 103.53], to_rgb=True, ratio_range=(1, 4)), dict( type='MinIoURandomCrop', min_ious=(0.1, 0.3, 0.5, 0.7, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(300, 300), keep_ratio=False), dict(type='RandomFlip', flip_ratio=0.5), dict( type='PhotoMetricDistortion', brightness_delta=32, contrast_range=(0.5, 1.5), saturation_range=(0.5, 1.5), hue_delta=18), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=8, workers_per_gpu=3, train=dict( type='RepeatDataset', times=5, dataset=dict( type='DOTA', ann_file='train.txt', img_prefix='training/image_2', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[123.675, 116.28, 103.53], to_rgb=True, ratio_range=(1, 4)), dict( type='MinIoURandomCrop', min_ious=(0.1, 0.3, 0.5, 0.7, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(300, 300), keep_ratio=False), dict(type='RandomFlip', flip_ratio=0.5), dict( type='PhotoMetricDistortion', brightness_delta=32, contrast_range=(0.5, 1.5), saturation_range=(0.5, 1.5), hue_delta=18), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ], data_root='AdaptedDataset/')), val=dict( type='DOTA', ann_file='val.txt', img_prefix='training/image_2', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ], data_root='AdaptedDataset/'), test=dict( type='DOTA', ann_file='train.txt', img_prefix='training/image_2', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ], data_root='AdaptedDataset/')) evaluation = dict(interval=1, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0005) optimizer_config = dict() lr_config = dict( policy='step', warmup=None, warmup_iters=500, warmup_ratio=0.001, step=[16, 22]) runner = dict(type='EpochBasedRunner', max_epochs=24) checkpoint_config = dict(interval=1) log_config = dict(interval=10, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [ dict(type='NumClassCheckHook'), dict(type='CheckInvalidLossHook', interval=50, priority='VERY_LOW') ] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = 'checkpoints/test.pth' resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' work_dir = 'tutorial_exps' seed = 0 gpu_ids = range(0, 1)
annotations:
import copy import os.path as osp
import mmcv import numpy as np
from mmdet.datasets.builder import DATASETS from mmdet.datasets.custom import CustomDataset
@DATASETS.register_module() class DOTA(CustomDataset):
[:n] + l[n:]