I try to fine-tuning from a COCO pre-trained model，but the loss is always nan.

muzishen commented 3 years ago

When I replace the datasets config, the train loss is always "nan". Can you help me, thx! I only change the path of new datasets and other configurations are the same as your example.

model = dict( type='GFL', backbone=dict( type='Res2Net', depth=50, scales=4, base_width=26, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', dcn=dict(type='DCN', deform_groups=1, fallback_on_stride=False), stage_with_dcn=(False, False, False, True)), neck=[ dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=1, add_extra_convs='on_output', num_outs=5), dict( type='SEPC', out_channels=256, stacked_convs=4, pconv_deform=False, lcconv_deform=True, ibn=True, pnorm_eval=True, lcnorm_eval=True, lcconv_padding=1) ], bbox_head=dict( type='GFLSEPCHead', num_classes=6, in_channels=256, stacked_convs=0, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', ratios=[1.0], octave_base_scale=8, scales_per_octave=1, strides=[8, 16, 32, 64, 128]), loss_cls=dict( type='QualityFocalLoss', use_sigmoid=True, beta=2.0, loss_weight=1.0), loss_dfl=dict(type='DistributionFocalLoss', loss_weight=0.25), reg_max=16, loss_bbox=dict(type='GIoULoss', loss_weight=2.0))) train_cfg = dict( assigner=dict(type='ATSSAssigner', topk=9), allowed_border=-1, pos_weight=-1, debug=False) test_cfg = dict( nms_pre=1000, min_bbox_size=0, score_thr=0.05, nms=dict(type='nms', iou_threshold=0.6), max_per_img=100) dataset_type = 'Tile' data_root = '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/datasets/tianchi_tile/tc_dataset/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Resize', img_scale=[(1333, 480), (1333, 960)], multiscale_mode='range', keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=4, workers_per_gpu=2, train=dict( type='Tile', ann_file= '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/datasets/tianchi_tile/tc_dataset/annotations/instances_train2017.json', img_prefix= '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/datasets/tianchi_tile/tc_dataset/train2017/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict( type='Resize', img_scale=[(1333, 480), (1333, 960)], multiscale_mode='range', keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='Tile', ann_file= '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/datasets/tianchi_tile/tc_dataset/annotations/instances_val2017.json', img_prefix= '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/datasets/tianchi_tile/tc_dataset/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='Tile', ann_file= '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/datasets/tianchi_tile/tc_dataset/annotations/instances_val2017.json', img_prefix= '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/datasets/tianchi_tile/tc_dataset/val2017/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1333, 800), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=1, metric='bbox') optimizer = dict(type='SGD', lr=0.05, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=None) lr_config = dict( policy='step', warmup='linear', warmup_iters=1000, warmup_ratio=0.001, step=[8, 11]) total_epochs = 12 checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = '/media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/coco_model/res2net50_v1b_26w_4s-3cf99910_mmdetv2.pth' resume_from = None workflow = [('train', 1)] fp16 = dict(loss_scale=512.0) work_dir = './work_dirs/universenet50_2008_fp16_4x2_mstrain_480_960_1x_smallbatch_finetuning_example' gpu_ids = range(0, 1)

shinya7y commented 3 years ago

res2net50_v1b_26w_4s-3cf99910_mmdetv2.pth is just an ImageNet pre-trained model for the Res2Net backbone.

Please try load_from = 'https://github.com/shinya7y/UniverseNet/releases/download/20.08/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth' # noqa

muzishen commented 3 years ago

res2net50_v1b_26w_4s-3cf99910_mmdetv2.pth is just an ImageNet pre-trained model for the Res2Net backbone.

Please try load_from = 'https://github.com/shinya7y/UniverseNet/releases/download/20.08/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth' # noqa

Thank you, I have changed the pre-trained model, but the loss still nan!

====>>>>>Log: Downloading: "https://github.com/shinya7y/UniverseNet/releases/download/20.08/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth" to /root/.cache/torch/checkpoints/universenet50_2008_fp16_4x4_mstrain_480_960_2x_coco_20200815_epoch_24-81356447.pth 100%|██████████████████████████████████████████████████████████| 70.1M/70.1M [04:00<00:00, 306kB/s] 2021-01-20 14:43:59,975 - mmdet - WARNING - The model and loaded state dict do not match exactly

size mismatch for bbox_head.gfl_cls.weight: copying a param with shape torch.Size([80, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([6, 256, 3, 3]). size mismatch for bbox_head.gfl_cls.bias: copying a param with shape torch.Size([80]) from checkpoint, the shape in current model is torch.Size([6]). 2021-01-20 14:43:59,976 - mmdet - INFO - Start running, host: root@shenfei-SYS-7048GR-TR, work_dir: /media/xieyi/b8f66a88-09dd-4190-b817-493aef1819d5/xieyi/Detetection/UniverseNet-master/work_dirs/universenet50_2008_fp16_4x2_mstrain_480_960_1x_smallbatch_finetuning_example 2021-01-20 14:43:59,976 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs 2021-01-20 14:45:03,859 - mmdet - INFO - Epoch [1][50/80] lr: 1.098e-04, eta: 0:19:21, time: 1.277, data_time: 0.070, memory: 7349, loss_cls: nan, loss_bbox: nan, loss_dfl: nan, loss: nan, grad_norm: nan

shinya7y commented 3 years ago

What happens if the learning rate is very low? optimizer = dict(type='SGD', lr=0.0001, momentum=0.9, weight_decay=0.0001)

For debug, log_config = dict(interval=1, hooks=[dict(type='TextLoggerHook')])

muzishen commented 3 years ago

What happens if the learning rate is very low? optimizer = dict(type='SGD', lr=0.0001, momentum=0.9, weight_decay=0.0001)

For debug, log_config = dict(interval=1, hooks=[dict(type='TextLoggerHook')])

If the learning rate is very low (e.g. 0.0001, 0.00001), the loss still is "nan"

shinya7y commented 3 years ago

Do simpler networks (e.g., RetinaNet, ATSS, GFL) work? Do popular datasets (e.g., COCO) work? Besides, I recommend using mmcv-full 1.1.2 to avoid version issues.

zhengye1995 commented 3 years ago

What happens if the learning rate is very low? optimizer = dict(type='SGD', lr=0.0001, momentum=0.9, weight_decay=0.0001) For debug, log_config = dict(interval=1, hooks=[dict(type='TextLoggerHook')])

If the learning rate is very low (e.g. 0.0001, 0.00001), the loss still is "nan"

I have the same issue, did your fix that?

muzishen commented 3 years ago

What happens if the learning rate is very low? optimizer = dict(type='SGD', lr=0.0001, momentum=0.9, weight_decay=0.0001) For debug, log_config = dict(interval=1, hooks=[dict(type='TextLoggerHook')])

If the learning rate is very low (e.g. 0.0001, 0.00001), the loss still is "nan"

I have the same issue, did your fix that?

I haven't fixed it yet.

shinya7y commented 3 years ago

I close this inactive issue, which lacks enough information for reproducing nan. If it is caused by empty gt, please use the latest code. I have fixed ATSSHead and GFLHead in this repository and mmdet repository in the same way.

shinya7y / UniverseNet

I try to fine-tuning from a COCO pre-trained model，but the loss is always nan. #12