open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.52k stars 9.46k forks source link

AMP in TOOD #10203

Open jigongbao opened 1 year ago

jigongbao commented 1 year ago

Describe the bug When I train TOOD with amp I get an error, when I remove the amp it works fine. 1682146952269

Reproduction

  1. What command or script did you run? I use “CUDA_VISIBLE_DEVICES=2,3,4,5 bash ./tools/dist_train.sh configs/tood/tood_r50_fpn_1x_vis.py 4 ”

  2. Did you make any modifications on the code or config? Did you understand what you have modified? Yes,I modified the output of FPN to output only P2 to P5, and also modified the optimizer to replace SGD with Adamw. Here is my config. base = [ '../base/datasets/visdrone_detection.py', '../base/schedules/schedule_vis.py', '../base/default_runtime.py' ]

model settings

model = dict( type='TOOD', data_preprocessor=dict( type='DetDataPreprocessor', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], bgr_to_rgb=True, pad_size_divisor=32), backbone=dict( type='ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, start_level=0, add_extra_convs='on_output', num_outs=4), bbox_head=dict( type='TOODHead', num_classes=10, in_channels=256, stacked_convs=6, feat_channels=256, anchor_type='anchor_free', anchor_generator=dict( type='AnchorGenerator', ratios=[1.0], octave_base_scale=8, scales_per_octave=1, strides=[4, 8, 16, 32]), bbox_coder=dict( type='DeltaXYWHBBoxCoder', target_means=[.0, .0, .0, .0], target_stds=[0.1, 0.1, 0.2, 0.2]), initial_loss_cls=dict( type='FocalLoss', use_sigmoid=True, activated=True, # use probability instead of logit as input gamma=2.0, alpha=0.25, loss_weight=1.0), loss_cls=dict( type='QualityFocalLoss', use_sigmoid=True, activated=True, # use probability instead of logit as input beta=2.0, loss_weight=1.0), loss_bbox=dict(type='GIoULoss', loss_weight=2.0)), train_cfg=dict( initial_epoch=4, initial_assigner=dict(type='ATSSAssigner', topk=9), assigner=dict(type='TaskAlignedAssigner', topk=13), alpha=1, beta=6, allowed_border=-1, pos_weight=-1, debug=False), test_cfg=dict( nms_pre=1500, min_bbox_size=0, score_thr=0.05, nms=dict(type='nms', iou_threshold=0.6), max_per_img=500))

optimizer

optim_wrapper = dict( type='OptimWrapper', paramwise_cfg=dict( custom_keys={ 'absolute_pos_embed': dict(decay_mult=0.), 'relative_position_bias_table': dict(decay_mult=0.), 'norm': dict(decay_mult=0.) }), optimizer=dict( delete=True, type='AdamW', lr=0.005, betas=(0.9, 0.999), weight_decay=0.05))

  1. What dataset did you use?

VisDrone.

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.

sys.platform: linux Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA GeForce RTX 2080 Ti CUDA_HOME: None GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.1 OpenCV: 4.7.0 MMEngine: 0.7.2 MMDetection: 3.0.0+unknown

lyzhongcrd commented 1 year ago

I can confirm the issue still exists on MMDetection 3.0 + PyTorch 2.0 the minimum code snippet with default config to reproduce the error python tools/train.py configs/tood/tood_r50_fpn_1x_coco.py --auto-scale-lr --amp The error I got: RuntimeError: torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast. Many models use a sigmoid layer right before the binary cross entropy layer. In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits or torch.nn.BCEWithLogitsLoss. binary_cross_entropy_with_logits and BCEWithLogits are safe to autocast. Is AMP/FP16 still not supported on TOOD as mentioned in #7113 ?