loss nan error - Githubissues

whut2962575697 commented 3 years ago

Thank you for your great work! But the loss is always nan when I train my own dataset.Can you help me?

This is my config:

# This config shows an example for small-batch fine-tuning from a COCO model.
# Please see also the MMDetection tutorial below.
# https://github.com/shinya7y/UniverseNet/blob/master/docs/tutorials/finetune.md

_base_ = [
    '../_base_/models/universenet50_2008.py',
    # Please change to your dataset config.
    # '../_base_/datasets/coco_detection_mstrain_480_960.py',
    '../_base_/schedules/schedule_1x.py',
    '../_base_/default_runtime.py'
]

model = dict(
    pretrained=None,
    # SyncBN is used in universenet50_2008.py
    # If total batch size < 16, please change BN settings of backbone.
    backbone=dict(
        norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True),
    # iBN of SEPC is used in universenet50_2008.py
    # If samples_per_gpu < 4, please change BN settings of SEPC.
    neck=[
        dict(
            type='FPN',
            in_channels=[256, 512, 1024, 2048],
            out_channels=256,
            start_level=1,
            add_extra_convs='on_output',
            # add_extra_convs=True,
            # extra_convs_on_inputs=False,
            num_outs=5),
        dict(
            type='SEPC',
            out_channels=256,
            stacked_convs=4,
            pconv_deform=False,
            lcconv_deform=True,
            ibn=True,
            pnorm_eval=True,  # please set True if samples_per_gpu < 4
            lcnorm_eval=True,  # please set True if samples_per_gpu < 4
            lcconv_padding=1)
    ],
    bbox_head=dict(num_classes=6))  # please change for your dataset

dataset_type = 'MyDataset'

data_root = '/cache/my_dataset/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

train_pipeline = [
dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),

    dict(type='Resize', img_scale=[(4000, 2000), (4000, 2400)],
         multiscale_mode='range', keep_ratio=True),

    dict(type='RandomFlip', flip_ratio=0.5),

    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),

    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',

        # img_scale=[(4000, 2000), (4000, 2200), (4000, 2400)],
        flip=True,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),

            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

data = dict(
    imgs_per_gpu=1,
    workers_per_gpu=1,
    train=dict(
        type=dataset_type,

        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',

        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))
evaluation = dict(interval=1, metric='bbox')

# Optimal total batch size depends on dataset size and learning rate.
# If image sizes are not so large and you have enough GPU memory,
# larger samples_per_gpu will be preferable.
# data = dict(samples_per_gpu=2)

# This config assumes that total batch size is 8 (4 GPUs * 2 samples_per_gpu).
# Since the batch size is half of other configs,
# the learning rate is also halved according to the Linear Scaling Rule.
# Tuning learning rate around it will be important on other datasets.
# For example, you can try 0.005 first, then 0.002, 0.01, 0.001, and 0.02.
optimizer = dict(type='SGD', lr=1.25e-3, momentum=0.9, weight_decay=0.0001)
# optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))

# If fine-tuning from COCO, gradients should not be so large.
# It is natural to train models without gradient clipping.
optimizer_config = dict(_delete_=True, grad_clip=None)

# If fine-tuning from COCO, a warmup_iters of 500 or less may be enough.
# This setting is not so important unless losses are unstable during warmup.
lr_config = dict(warmup_iters=500)

fp16 = dict(loss_scale=512.)

# Please set `load_from` to use a COCO pre-trained model.
load_from = '/cache/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth'  # noqa

shinya7y commented 3 years ago

Could you please show your training log? Example: https://github.com/shinya7y/UniverseNet/issues/5#issuecomment-674520670

whut2962575697 commented 3 years ago

Thank you!

Python: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0: Tesla V100-PCIE-32GB
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0
OpenCV: 4.5.1
MMCV: 1.1.5
MMDetection: 2.4.0+unknown
MMDetection Compiler: GCC 5.4
MMDetection CUDA Compiler: 10.1
------------------------------------------------------------

2021-01-20 14:21:33,609 - mmdet - INFO - Distributed training: False
2021-01-20 14:21:33,960 - mmdet - INFO - Config:
model = dict(
    type='GFL',
    pretrained=None,
    backbone=dict(
        type='Res2Net',
        depth=50,
        scales=4,
        base_width=26,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        dcn=dict(type='DCN', deform_groups=1, fallback_on_stride=False),
        stage_with_dcn=(False, False, False, True)),
    neck=[
        dict(
            type='FPN',
            in_channels=[256, 512, 1024, 2048],
            out_channels=256,
            start_level=1,
            add_extra_convs='on_output',
            num_outs=5),
        dict(
            type='SEPC',
            out_channels=256,
            stacked_convs=4,
            pconv_deform=False,
            lcconv_deform=True,
            ibn=True,
            pnorm_eval=True,
            lcnorm_eval=True,
            lcconv_padding=1)
    ],
    bbox_head=dict(
        type='GFLSEPCHead',
        num_classes=6,
        in_channels=256,
        stacked_convs=0,
        feat_channels=256,
        anchor_generator=dict(
            type='AnchorGenerator',
            ratios=[1.0],
            octave_base_scale=8,
            scales_per_octave=1,
            strides=[8, 16, 32, 64, 128]),
        loss_cls=dict(
            type='QualityFocalLoss',
            use_sigmoid=True,
            beta=2.0,
            loss_weight=1.0),
        loss_dfl=dict(type='DistributionFocalLoss', loss_weight=0.25),
        reg_max=16,
        loss_bbox=dict(type='GIoULoss', loss_weight=2.0),
        reg_decoded_bbox=True))
train_cfg = dict(
    assigner=dict(type='ATSSAssigner', topk=9),
    allowed_border=-1,
    pos_weight=-1,
    debug=False)
test_cfg = dict(
    nms_pre=1000,
    min_bbox_size=0,
    score_thr=0.05,
    nms=dict(type='nms', iou_threshold=0.6),
    max_per_img=100)
optimizer = dict(type='SGD', lr=0.000125, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=2000,
    warmup_ratio=0.001,
    step=[8, 11])
total_epochs = 12
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = '/cache/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth'
resume_from = None
workflow = [('train', 1)]
dataset_type = 'TCDataset'
data_root = '/cache/tc_dataset/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
albu_train_transforms = []
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='Resize',
        img_scale=[(6000, 3600), (6000, 4000)],
        multiscale_mode='range',
        keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=[(6000, 3600), (6000, 3800), (6000, 4000)],
        flip=True,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    imgs_per_gpu=1,
    workers_per_gpu=1,
    train=dict(
        type='TCDataset',
        ann_file='/cache/tc_dataset/annotations/instances_train2017.json',
        img_prefix='/cache/tc_dataset/train2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(
                type='Resize',
                img_scale=[(6000, 3600), (6000, 4000)],
                multiscale_mode='range',
                keep_ratio=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ]),
    val=dict(
        type='TCDataset',
        ann_file='/cache/tc_dataset/annotations/instances_val2017.json',
        img_prefix='/cache/tc_dataset/val2017/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=[(6000, 3600), (6000, 3800), (6000, 4000)],
                flip=True,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]),
    test=dict(
        type='TCDataset',
        ann_file='/cache/testA.json',
        img_prefix='/cache/testA_imgs/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=[(6000, 3600), (6000, 3800), (6000, 4000)],
                flip=True,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ]))
evaluation = dict(interval=1, metric='bbox')
fp16 = dict(loss_scale=512.0)
work_dir = './work_dirs/universenet50_2008_1x'
gpu_ids = range(0, 1)

loading annotations into memory...
Done (t=0.12s)
creating index...
index created!
2021-01-20 14:21:34,920 - mmdet - WARNING - "imgs_per_gpu" is deprecated in MMDet V2.0. Please use "samples_per_gpu" instead
2021-01-20 14:21:34,921 - mmdet - WARNING - Automatically set "samples_per_gpu"="imgs_per_gpu"=1 in this experiments
loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
2021-01-20 14:21:38,261 - mmdet - INFO - load checkpoint from /cache/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth
2021-01-20 14:21:38,388 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for bbox_head.gfl_cls.weight: copying a param with shape torch.Size([80, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([6, 256, 3, 3]).
size mismatch for bbox_head.gfl_cls.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([6]).
2021-01-20 14:21:38,390 - mmdet - INFO - Start running, host: work@job9391f5af-job-universenet2021-5303-0, work_dir: /cache/user-job-dir/codes/UniverseNet/work_dirs/universenet50_2008_1x
2021-01-20 14:21:38,390 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
[W TensorIterator.cpp:924] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the inputs are not in channels_last format. (function operator())
2021-01-20 14:22:54,996 - mmdet - INFO - Epoch [1][50/4310] lr: 3.184e-06, eta: 21:58:05, time: 1.531, data_time: 0.207, memory: 19023, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:24:11,893 - mmdet - INFO - Epoch [1][100/4310]    lr: 6.306e-06, eta: 21:59:58, time: 1.538, data_time: 0.224, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:25:25,627 - mmdet - INFO - Epoch [1][150/4310]    lr: 9.428e-06, eta: 21:41:37, time: 1.475, data_time: 0.184, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:26:39,751 - mmdet - INFO - Epoch [1][200/4310]    lr: 1.255e-05, eta: 21:33:30, time: 1.482, data_time: 0.226, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:27:54,799 - mmdet - INFO - Epoch [1][250/4310]    lr: 1.567e-05, eta: 21:31:18, time: 1.501, data_time: 0.213, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan
2021-01-20 14:29:06,125 - mmdet - INFO - Epoch [1][300/4310]    lr: 1.879e-05, eta: 21:18:48, time: 1.427, data_time: 0.182, memory: 19033, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan

shinya7y commented 3 years ago

PyTorch compiling details: PyTorch built with:
  - CUDA Runtime 10.2
MMDetection CUDA Compiler: 10.1

Please use the same CUDA version, though it may be irrelevant.

Do simpler networks (e.g., RetinaNet, ATSS, GFL) work? Do popular datasets (e.g., COCO) work?

whut2962575697 commented 3 years ago

I train the dataset with Cascade R-CNN, and can get a goodresult.

Average Precision  (AP) @[ IoU=0.10:0.50 | area=   all | maxDets=100 ] = 0.675
 Average Precision  (AP) @[ IoU=0.10      | area=   all | maxDets=100 ] = 0.708
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.617
 Average Precision  (AP) @[ IoU=0.10:0.50 | area= small | maxDets=100 ] = 0.603
 Average Precision  (AP) @[ IoU=0.10:0.50 | area=medium | maxDets=100 ] = 0.793
 Average Precision  (AP) @[ IoU=0.10:0.50 | area= large | maxDets=100 ] = 0.959
 Average Recall     (AR) @[ IoU=0.10:0.50 | area=   all | maxDets=  1 ] = 0.604
 Average Recall     (AR) @[ IoU=0.10:0.50 | area=   all | maxDets= 10 ] = 0.898
 Average Recall     (AR) @[ IoU=0.10:0.50 | area=   all | maxDets=100 ] = 0.947
 Average Recall     (AR) @[ IoU=0.10:0.50 | area= small | maxDets=100 ] = 0.925
 Average Recall     (AR) @[ IoU=0.10:0.50 | area=medium | maxDets=100 ] = 0.965
 Average Recall     (AR) @[ IoU=0.10:0.50 | area= large | maxDets=100 ] = 0.998

shinya7y commented 3 years ago

Does training on COCO with the original finetuning_example.py work? In the case of this issue, 500 iterations will be enough to check nan.

zhengye1995 commented 3 years ago

Thank you for your great work! But the loss is always nan when I train my own dataset.Can you help me?

This is my config:

# This config shows an example for small-batch fine-tuning from a COCO model.
# Please see also the MMDetection tutorial below.
# https://github.com/shinya7y/UniverseNet/blob/master/docs/tutorials/finetune.md

_base_ = [
    '../_base_/models/universenet50_2008.py',
    # Please change to your dataset config.
    # '../_base_/datasets/coco_detection_mstrain_480_960.py',
    '../_base_/schedules/schedule_1x.py',
    '../_base_/default_runtime.py'
]

model = dict(
    pretrained=None,
    # SyncBN is used in universenet50_2008.py
    # If total batch size < 16, please change BN settings of backbone.
    backbone=dict(
        norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True),
    # iBN of SEPC is used in universenet50_2008.py
    # If samples_per_gpu < 4, please change BN settings of SEPC.
    neck=[
        dict(
            type='FPN',
            in_channels=[256, 512, 1024, 2048],
            out_channels=256,
            start_level=1,
            add_extra_convs='on_output',
            # add_extra_convs=True,
            # extra_convs_on_inputs=False,
            num_outs=5),
        dict(
            type='SEPC',
            out_channels=256,
            stacked_convs=4,
            pconv_deform=False,
            lcconv_deform=True,
            ibn=True,
            pnorm_eval=True,  # please set True if samples_per_gpu < 4
            lcnorm_eval=True,  # please set True if samples_per_gpu < 4
            lcconv_padding=1)
    ],
    bbox_head=dict(num_classes=6))  # please change for your dataset

dataset_type = 'MyDataset'

data_root = '/cache/my_dataset/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

train_pipeline = [
dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),

    dict(type='Resize', img_scale=[(4000, 2000), (4000, 2400)],
         multiscale_mode='range', keep_ratio=True),

    dict(type='RandomFlip', flip_ratio=0.5),

    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),

    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',

        # img_scale=[(4000, 2000), (4000, 2200), (4000, 2400)],
        flip=True,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),

            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

data = dict(
    imgs_per_gpu=1,
    workers_per_gpu=1,
    train=dict(
        type=dataset_type,

        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',

        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))
evaluation = dict(interval=1, metric='bbox')

# Optimal total batch size depends on dataset size and learning rate.
# If image sizes are not so large and you have enough GPU memory,
# larger samples_per_gpu will be preferable.
# data = dict(samples_per_gpu=2)

# This config assumes that total batch size is 8 (4 GPUs * 2 samples_per_gpu).
# Since the batch size is half of other configs,
# the learning rate is also halved according to the Linear Scaling Rule.
# Tuning learning rate around it will be important on other datasets.
# For example, you can try 0.005 first, then 0.002, 0.01, 0.001, and 0.02.
optimizer = dict(type='SGD', lr=1.25e-3, momentum=0.9, weight_decay=0.0001)
# optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))

# If fine-tuning from COCO, gradients should not be so large.
# It is natural to train models without gradient clipping.
optimizer_config = dict(_delete_=True, grad_clip=None)

# If fine-tuning from COCO, a warmup_iters of 500 or less may be enough.
# This setting is not so important unless losses are unstable during warmup.
lr_config = dict(warmup_iters=500)

fp16 = dict(loss_scale=512.)

# Please set `load_from` to use a COCO pre-trained model.
load_from = '/cache/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth'  # noqa

I have the same issue, did your fix that?

whut2962575697 commented 3 years ago

Sorry, it's not be fixed yet

shinya7y commented 3 years ago

I close this inactive issue, which lacks enough information for reproducing nan. If it is caused by empty gt, please use the latest code. I have fixed ATSSHead and GFLHead in this repository and mmdet repository in the same way.

shinya7y / UniverseNet

loss nan error #13