open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.21k stars 9.4k forks source link

loss become infinite or NaN Training SSD300 on Customize Dataset #9363

Closed evaniajoycelin closed 1 year ago

evaniajoycelin commented 1 year ago

Error when training SSD300 with customize dataset. Am I missing something?

/usr/local/lib/python3.7/dist-packages/mmcv/init.py:21: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. 'On January 1, 2023, MMCV will release v2.0.0, in which it will remove ' /content/gdrive/MyDrive/OpenMMLab/mmdetection/mmdet/utils/setup_env.py:39: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. f'Setting OMP_NUM_THREADS environment variable for each process ' /content/gdrive/MyDrive/OpenMMLab/mmdetection/mmdet/utils/setup_env.py:49: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. f'Setting MKL_NUM_THREADS environment variable for each process ' 2022-11-21 18:01:30,785 - mmdet - INFO - Environment info:

sys.platform: linux Python: 3.7.15 (default, Oct 12 2022, 19:14:55) [GCC 7.5.0] CUDA available: True GPU 0: Tesla T4 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.2, V11.2.152 GCC: x86_64-linux-gnu-gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1+cu113 PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.3.2 (built against CUDA 11.5)
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1+cu113 OpenCV: 4.6.0 MMCV: 1.7.0 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.3 MMDetection: 2.25.3+e71b499

2022-11-21 18:01:31,612 - mmdet - INFO - Distributed training: False 2022-11-21 18:01:32,382 - mmdet - INFO - Config: input_size = 300 model = dict( type='SingleStageDetector', backbone=dict( type='SSDVGG', depth=16, with_last_pool=False, ceil_mode=True, out_indices=(3, 4), out_feature_indices=(22, 34), init_cfg=dict( type='Pretrained', checkpoint='open-mmlab://vgg16_caffe')), neck=dict( type='SSDNeck', in_channels=(512, 1024), out_channels=(512, 1024, 512, 256, 256, 256), level_strides=(2, 2, 1, 1), level_paddings=(1, 1, 0, 0), l2_norm_scale=20), bbox_head=dict( type='SSDHead', in_channels=(512, 1024, 512, 256, 256, 256), num_classes=4, anchor_generator=dict( type='SSDAnchorGenerator', scale_major=False, input_size=300, basesize_ratio_range=(0.15, 0.9), strides=[8, 16, 32, 64, 100, 300], ratios=[[2], [2, 3], [2, 3], [2, 3], [2], [2]]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2])), train_cfg=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.0, ignore_iof_thr=-1, gt_max_assign_all=False), smoothl1_beta=1.0, allowed_border=-1, pos_weight=-1, neg_pos_ratio=3, debug=False), test_cfg=dict( nms_pre=1000, nms=dict(type='nms', iou_threshold=0.45), min_bbox_size=0, score_thr=0.02, max_per_img=200)) cudnn_benchmark = True dataset_type = 'MyDataset' data_root = 'Fruitsv2-5' img_norm_cfg = dict(mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[123.675, 116.28, 103.53], to_rgb=True, ratio_range=(1, 4)), dict( type='MinIoURandomCrop', min_ious=(0.1, 0.3, 0.5, 0.7, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(300, 300), keep_ratio=False), dict(type='RandomFlip', flip_ratio=0.5), dict( type='PhotoMetricDistortion', brightness_delta=32, contrast_range=(0.5, 1.5), saturation_range=(0.5, 1.5), hue_delta=18), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=8, workers_per_gpu=3, train=dict( type='RepeatDataset', times=5, dataset=dict( type='MyDataset', ann_file='Fruitsv2-5/train/labels/', img_prefix='Fruitsv2-5/train/images/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Expand', mean=[123.675, 116.28, 103.53], to_rgb=True, ratio_range=(1, 4)), dict( type='MinIoURandomCrop', min_ious=(0.1, 0.3, 0.5, 0.7, 0.9), min_crop_size=0.3), dict(type='Resize', img_scale=(300, 300), keep_ratio=False), dict(type='RandomFlip', flip_ratio=0.5), dict( type='PhotoMetricDistortion', brightness_delta=32, contrast_range=(0.5, 1.5), saturation_range=(0.5, 1.5), hue_delta=18), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ])), val=dict( type='MyDataset', ann_file='Fruitsv2-5/valid/labels/', img_prefix='Fruitsv2-5/valid/images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='MyDataset', ann_file='Fruitsv2-5/test/labels/', img_prefix='Fruitsv2-5/test/images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(300, 300), flip=False, transforms=[ dict(type='Resize', keep_ratio=False), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[1, 1, 1], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=1, metric='mAP') optimizer = dict(type='SGD', lr=2.5e-05, momentum=0.9, weight_decay=0.0005) optimizer_config = dict() lr_config = dict( policy='step', warmup=None, warmup_iters=500, warmup_ratio=0.001, step=[16, 22]) runner = dict(type='EpochBasedRunner', max_epochs=24) checkpoint_config = dict(interval=1) log_config = dict( interval=50, hooks=[dict(type='TextLoggerHook'), dict(type='TensorboardLoggerHook')]) custom_hooks = [ dict(type='NumClassCheckHook'), dict(type='CheckInvalidLossHook', interval=50, priority='VERY_LOW') ] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = '/content/gdrive/MyDrive/OpenMMLab/mmdetection/checkpoints/ssd300_coco_20210803_015428-d231a06e.pth' resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' auto_scale_lr = dict(enable=False, base_batch_size=64) work_dir = '/content/gdrive/MyDrive/OpenMMLab/mmdetection/checkpoints/fine_tuned' auto_resume = False gpu_ids = [0]

2022-11-21 18:01:32,383 - mmdet - INFO - Set random seed to 1129739117, deterministic: False 2022-11-21 18:01:32,607 - mmdet - INFO - initialize SSDVGG with init_cfg {'type': 'Pretrained', 'checkpoint': 'open-mmlab://vgg16_caffe'} 2022-11-21 18:01:32,608 - mmcv - INFO - load model from: open-mmlab://vgg16_caffe 2022-11-21 18:01:32,608 - mmcv - INFO - load checkpoint from openmmlab path: open-mmlab://vgg16_caffe 2022-11-21 18:01:32,696 - mmdet - INFO - initialize SSDNeck with init_cfg [{'type': 'Xavier', 'distribution': 'uniform', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}, {'type': 'Constant', 'val': 20, 'override': {'name': 'l2_norm'}}] 2022-11-21 18:01:32,718 - mmdet - INFO - initialize SSDHead with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform', 'bias': 0} /content/gdrive/MyDrive/OpenMMLab/mmdetection/mmdet/datasets/custom.py:182: UserWarning: CustomDataset does not support filtering empty gt images. 'CustomDataset does not support filtering empty gt images.') /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) 2022-11-21 18:01:43,937 - mmdet - INFO - Automatic scaling of learning rate (LR) has been disabled. 2022-11-21 18:01:45,106 - mmdet - INFO - load checkpoint from local path: /content/gdrive/MyDrive/OpenMMLab/mmdetection/checkpoints/ssd300_coco_20210803_015428-d231a06e.pth 2022-11-21 18:01:45,299 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for bbox_head.cls_convs.0.0.weight: copying a param with shape torch.Size([324, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([20, 512, 3, 3]). size mismatch for bbox_head.cls_convs.0.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([20]). size mismatch for bbox_head.cls_convs.1.0.weight: copying a param with shape torch.Size([486, 1024, 3, 3]) from checkpoint, the shape in current model is torch.Size([30, 1024, 3, 3]). size mismatch for bbox_head.cls_convs.1.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([30]). size mismatch for bbox_head.cls_convs.2.0.weight: copying a param with shape torch.Size([486, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([30, 512, 3, 3]). size mismatch for bbox_head.cls_convs.2.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([30]). size mismatch for bbox_head.cls_convs.3.0.weight: copying a param with shape torch.Size([486, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([30, 256, 3, 3]). size mismatch for bbox_head.cls_convs.3.0.bias: copying a param with shape torch.Size([486]) from checkpoint, the shape in current model is torch.Size([30]). size mismatch for bbox_head.cls_convs.4.0.weight: copying a param with shape torch.Size([324, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([20, 256, 3, 3]). size mismatch for bbox_head.cls_convs.4.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([20]). size mismatch for bbox_head.cls_convs.5.0.weight: copying a param with shape torch.Size([324, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([20, 256, 3, 3]). size mismatch for bbox_head.cls_convs.5.0.bias: copying a param with shape torch.Size([324]) from checkpoint, the shape in current model is torch.Size([20]). 2022-11-21 18:01:45,304 - mmdet - INFO - Start running, host: root@5d139416e600, work_dir: /content/gdrive/MyDrive/OpenMMLab/mmdetection/checkpoints/fine_tuned 2022-11-21 18:01:45,305 - mmdet - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook


before_train_epoch: (VERY_HIGH ) StepLrUpdaterHook
(NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook


before_train_iter: (VERY_HIGH ) StepLrUpdaterHook
(LOW ) IterTimerHook
(LOW ) EvalHook


after_train_iter: (ABOVE_NORMAL) OptimizerHook
(NORMAL ) CheckpointHook
(LOW ) IterTimerHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook
(VERY_LOW ) CheckInvalidLossHook


after_train_epoch: (NORMAL ) CheckpointHook
(LOW ) EvalHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook


before_val_epoch: (NORMAL ) NumClassCheckHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook


before_val_iter: (LOW ) IterTimerHook


after_val_iter: (LOW ) IterTimerHook


after_val_epoch: (VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook


after_run: (VERY_LOW ) TextLoggerHook
(VERY_LOW ) TensorboardLoggerHook


2022-11-21 18:01:45,305 - mmdet - INFO - workflow: [('train', 1)], max: 24 epochs 2022-11-21 18:01:45,306 - mmdet - INFO - Checkpoints will be saved to /content/gdrive/MyDrive/OpenMMLab/mmdetection/checkpoints/fine_tuned by HardDiskBackend. /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) 2022-11-21 18:02:12,953 - mmdet - INFO - Epoch [1][50/803] lr: 2.500e-05, eta: 2:40:55, time: 0.502, data_time: 0.083, memory: 5471, loss_cls: nan, loss_bbox: nan, loss: nan INFO:mmdet:Epoch [1][50/803] lr: 2.500e-05, eta: 2:40:55, time: 0.502, data_time: 0.083, memory: 5471, loss_cls: nan, loss_bbox: nan, loss: nan 2022-11-21 18:02:12,962 - mmdet - INFO - loss become infinite or NaN! INFO:mmdet:loss become infinite or NaN! Traceback (most recent call last): File "tools/train.py", line 244, in main() File "tools/train.py", line 240, in main meta=meta) File "/content/gdrive/MyDrive/OpenMMLab/mmdetection/mmdet/apis/train.py", line 244, in train_detector runner.run(data_loaders, cfg.workflow) File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], **kwargs) File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_iter') File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/base_runner.py", line 317, in call_hook getattr(hook, fn_name)(self) File "/content/gdrive/MyDrive/OpenMMLab/mmdetection/mmdet/core/hook/checkloss_hook.py", line 24, in after_train_iter runner.logger.info('loss become infinite or NaN!') AssertionError: None

BIGWangYuDong commented 1 year ago

FAQ have some solutions: https://github.com/open-mmlab/mmdetection/blob/master/docs/en/faq.md#training

Vishalkagade commented 6 months ago

param_scheduler = [

Linear learning rate warm-up scheduler

dict(
    type='LinearLR',  # Use linear policy to warmup learning rate
    start_factor=0.001, # The ratio of the starting learning rate used for warmup
    by_epoch=False,  # The warmup learning rate is updated by iteration
    begin=0,  # Start from the first iteration
    end=1000),  # End the warmup at the 500th iteration
# The main LRScheduler
dict(
    type='MultiStepLR',  # Use multi-step learning rate policy during training
    by_epoch=True,  # The learning rate is updated by epoch
    begin=0,   # Start from the first epoch
    end=12,  # End at the 12th epoch
    milestones=[8, 11],  # Epochs to decay the learning rate
    gamma=0.1)  # The learning rate decay ratio

] In learning rate scheduler,go into Linear learning rate warm-up scheduler, and change warm-up iteration to 1200. i.e. param_scheduler = [

Linear learning rate warm-up scheduler

dict(
    type='LinearLR',  # Use linear policy to warmup learning rate
    start_factor=0.001, # The ratio of the starting learning rate used for warmup
    by_epoch=False,  # The warmup learning rate is updated by iteration
    begin=0,  # Start from the first iteration
    end=1200),  # End the warmup at the 500th iteration