There is a strange phenomenon about the learning rate in dis_train

zhongqiu1245 commented 3 years ago

Hi，dear authors

I have two 2080Ti GPUs, the data set format is COCO format (labelme), but the data is my own data

I used config/faster_rcnn_r50_fpn_1x_coco.py, but made a little modification

Modifications:

I changed total_epoch to 50
I changed num_class to 1
I modified mean and std to mean and std of my data set
According to the rules of the document, reset lr to 0.0025
All other parameters have not been modified

Strange phenomenon

According to the setting rules of lr in the document, I set lr to *0.005 (0.02/82), and found that loss is always nan**; then I tried to replace loss with GioU, but found that it didn’t work at all
After I lowered the lr to 0.0025 (0.02/8), the loss was normal in the warmup stage, but when the lr increased to 0.0025, the loss became nan again, and ERROR-The testing results of the whole dataset is" empty"
After I lowered the lr to 0.0001, the loss did not appear nan

I checked the data set and can be quite sure that bbox=0 does not exist in it My computer environment is also very normal, the configuration is normal

So is this a bug or a problem with my own settings? Is the lr of two GPUs only 0.0001? Any help is greatly appreciated！

My Config： model = dict( type='FasterRCNN', pretrained='torchvision://resnet50', backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch'), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, num_outs=5), rpn_head=dict( type='RPNHead', in_channels=256, feat_channels=256, anchor_generator=dict( type='AnchorGenerator', scales=[8], ratios=[0.5, 1.0, 2.0], strides=[4, 8, 16, 32, 64]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[1.0, 1.0, 1.0, 1.0]), loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0), loss_bbox=dict(type='L1Loss', loss_weight=1.0)), roi_head=dict( type='StandardRoIHead', bbox_roi_extractor=dict( type='SingleRoIExtractor', roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0), out_channels=256, featmap_strides=[4, 8, 16, 32]), bbox_head=dict( type='Shared2FCBBoxHead', in_channels=256, fc_out_channels=1024, roi_feat_size=7, num_classes=1, bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[0.1, 0.1, 0.2, 0.2]), reg_class_agnostic=False, loss_cls=dict( type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0), loss_bbox=dict(type='L1Loss', loss_weight=1.0))), train_cfg=dict( rpn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.7, neg_iou_thr=0.3, min_pos_iou=0.3, match_low_quality=True, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=256, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False), allowed_border=-1, pos_weight=-1, debug=False), rpn_proposal=dict( nms_across_levels=False, nms_pre=2000, nms_post=1000, max_num=1000, nms_thr=0.7, min_bbox_size=0), rcnn=dict( assigner=dict( type='MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.5, match_low_quality=False, ignore_iof_thr=-1), sampler=dict( type='RandomSampler', num=512, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True), pos_weight=-1, debug=False)), test_cfg=dict( rpn=dict( nms_across_levels=False, nms_pre=1000, nms_post=1000, max_num=1000, nms_thr=0.7, min_bbox_size=0), rcnn=dict( score_thr=0.05, nms=dict(type='nms', iou_threshold=0.5), max_per_img=100))) dataset_type = 'CocoDataset' data_root = 'data/coco/HC_T2_NoCrop/' img_norm_cfg = dict( mean=[0.102, 0.102, 0.102], std=[0.112, 0.112, 0.112], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', img_scale=(512, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[0.102, 0.102, 0.102], std=[0.112, 0.112, 0.112], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(512, 512), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0.102, 0.102, 0.102], std=[0.112, 0.112, 0.112], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=2, workers_per_gpu=2, train=dict( type='CocoDataset', ann_file='data/coco/HC_T2_NoCrop/annotations/instances_train2017.json', img_prefix='data/coco/HC_T2_NoCrop/train2017', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='Resize', img_scale=(512, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.5), dict( type='Normalize', mean=[0.102, 0.102, 0.102], std=[0.112, 0.112, 0.112], to_rgb=True), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ]), val=dict( type='CocoDataset', ann_file='data/coco/HC_T2_NoCrop/annotations/instances_val2017.json', img_prefix='data/coco/HC_T2_NoCrop/val2017', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(512, 512), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0.102, 0.102, 0.102], std=[0.112, 0.112, 0.112], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='CocoDataset', ann_file='data/coco/HC_T2_NoCrop/annotations/instances_val2017.json', img_prefix='data/coco/HC_T2_NoCrop/val2017', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(512, 512), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[0.102, 0.102, 0.102], std=[0.112, 0.112, 0.112], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) evaluation = dict(interval=1, metric='bbox') optimizer = dict(type='SGD', lr=0.0001, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=None) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.001, step=[8, 11]) total_epochs = 50 checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) custom_hooks = [dict(type='NumClassCheckHook')] dist_params = dict(backend='nccl') log_level = 'INFO' load_from = 'mmdet-checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth' resume_from = None workflow = [('train', 1)] work_dir = 'my_work_mmdet' gpu_ids = range(0, 2)

xvjiarui commented 3 years ago

It seems that you are loading from a checkpoint trained on COCO. I don't think you should change the mean and std. There are used for the retrain model and should be the same during finetuning (it is the mean and std of ImageNet). You should try some thing smaller than 0.02/8.

zhongqiu1245 commented 3 years ago

@xvjiarui Hi，thank you for your reply！ The reason why I modified mean and std is that the image color of my dataset and ImageNet's image are very different (my dataset images are all grayscale images) You mean to suggest that I keep the default mean and std and lr smaller than 0.02/8(if I load from a checkpoint trained on COCO )，right?

xvjiarui commented 3 years ago

Yep. That's correct. Since the backbone is frozen during training.

zhongqiu1245 commented 3 years ago

@xvjiarui Thank you! I will try

open-mmlab / mmdetection

There is a strange phenomenon about the learning rate in dis_train #4616