open-mmlab / mmrotate

OpenMMLab Rotated Object Detection Toolbox and Benchmark
https://mmrotate.readthedocs.io/en/latest/
Apache License 2.0
1.89k stars 559 forks source link

[Bug] With MMrotate dev1.x branch, train rotated faster RCNN on DOTA loss=nan #988

Open harmlessSR opened 9 months ago

harmlessSR commented 9 months ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

1.x branch https://github.com/open-mmlab/mmrotate/tree/1.x

Environment

sys.platform: linux Python: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6: NVIDIA A100-SXM4-80GB CUDA_HOME: /usr/local/cuda-11.6 NVCC: Cuda compilation tools, release 11.6, V11.6.124 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.1 OpenCV: 4.9.0 MMEngine: 0.10.3 MMRotate: 1.0.0rc1+

Reproduces the problem - code sample

base = [ '../base/datasets/dota_my.py', '../base/schedules/schedule_1x.py', '../base/default_runtime.py' ]

angle_version = 'le90' model = dict( type='mmdet.FasterRCNN', data_preprocessor=dict( type='mmdet.DetDataPreprocessor', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], bgr_to_rgb=True, pad_size_divisor=32, boxtype2tensor=False), backbone=dict( type='mmdet.ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=True, style='pytorch', init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')), neck=dict( type='mmdet.FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, num_outs=5), rpn_head=dict( type='mmdet.RPNHead', in_channels=256, feat_channels=256, anchor_generator=dict( type='mmdet.AnchorGenerator', scales=[8], ratios=[0.5, 1.0, 2.0], strides=[4, 8, 16, 32, 64], use_box_type=True), bbox_coder=dict( type='DeltaXYWHHBBoxCoder', target_means=[0.0, 0.0, 0.0, 0.0], target_stds=[1.0, 1.0, 1.0, 1.0], use_box_type=True), loss_cls=dict( type='mmdet.CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0), loss_bbox=dict( type='mmdet.SmoothL1Loss', beta=0.1111111111111111, loss_weight=1.0)), roi_head=dict( type='mmdet.StandardRoIHead', bbox_roi_extractor=dict( type='mmdet.SingleRoIExtractor', roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0), out_channels=256, featmap_strides=[4, 8, 16, 32]), bbox_head=dict( type='mmdet.Shared2FCBBoxHead', predict_box_type='rbox', in_channels=256, fc_out_channels=1024, roi_feat_size=7, num_classes=15, reg_predictor_cfg=dict(type='mmdet.Linear'), cls_predictor_cfg=dict(type='mmdet.Linear'), bbox_coder=dict( type='DeltaXYWHTHBBoxCoder', angle_version=angle_version, norm_factor=2, edge_swap=True, target_means=(.0, .0, .0, .0, .0), target_stds=(0.1, 0.1, 0.2, 0.2, 0.1)), reg_class_agnostic=True, loss_cls=dict( type='mmdet.CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0), loss_bbox=dict( type='mmdet.SmoothL1Loss', beta=1.0, loss_weight=1.0))), train_cfg=dict( rpn=dict( assigner=dict( type='mmdet.MaxIoUAssigner', pos_iou_thr=0.7, neg_iou_thr=0.3, min_pos_iou=0.3, match_low_quality=True, ignore_iof_thr=-1, iou_calculator=dict(type='RBbox2HBboxOverlaps2D')), sampler=dict( type='mmdet.RandomSampler', num=256, pos_fraction=0.5, neg_pos_ub=-1, add_gt_as_proposals=False), allowed_border=0, pos_weight=-1, debug=False), rpn_proposal=dict( nms_pre=2000, max_per_img=2000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( assigner=dict( type='mmdet.MaxIoUAssigner', pos_iou_thr=0.5, neg_iou_thr=0.5, min_pos_iou=0.5, match_low_quality=False, ignore_iof_thr=-1, iou_calculator=dict(type='RBbox2HBboxOverlaps2D')), sampler=dict( type='mmdet.RandomSampler', num=512, pos_fraction=0.25, neg_pos_ub=-1, add_gt_as_proposals=True), pos_weight=-1, debug=False)), test_cfg=dict( rpn=dict( nms_pre=2000, max_per_img=2000, nms=dict(type='nms', iou_threshold=0.7), min_bbox_size=0), rcnn=dict( nms_pre=2000, min_bbox_size=0, score_thr=0.05, nms=dict(type='nms_rotated', iou_threshold=0.1), max_per_img=2000)))

optim_wrapper = dict( type='OptimWrapper', optimizer=dict(type='SGD', lr=0.020, momentum=0.9, weight_decay=0.0001), clip_grad=dict(max_norm=35, norm_type=2))

added config

train_dataloader = dict( batch_size=4, num_workers=4)

work_dir = 'work_dirs/r50_test/'

test_evaluator = dict( outfile_prefix='./work_dirs/r50_test')

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0,1 ./tools/dist_train.sh ./configs/rotated_faster_rcnn/rotated-faster-rcnn-le90_r50_fpn_1x_dota.py 2

Reproduces the problem - error message

(mmrotatedev1toch112) WuMingrui@Turing14:~/Workspace/mmrotate-dev-1.x$ CUDA_VISIBLE_DEVICES=0,1 ./tools/dist_train.sh ./configs/rotated_faster_rcnn/rotated-faster-rcnn-le90_r50_fpn_1x_dota.py 2 /home/WuMingrui/miniconda3/envs/mmrotatedev1toch112/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/WuMingrui/miniconda3/envs/mmrotatedev1toch112/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( /home/WuMingrui/miniconda3/envs/mmrotatedev1toch112/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. warnings.warn( 01/30 13:08:57 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 549113804 GPU 0,1: NVIDIA A100-SXM4-80GB CUDA_HOME: /usr/local/cuda-11.6 NVCC: Cuda compilation tools, release 11.6, V11.6.124 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 549113804 Distributed launcher: pytorch Distributed training: True GPU number: 2

01/30 13:08:57 - mmengine - INFO - Config: angle_version = 'le90' backend_args = None data_root = '../DOTASplit/' dataset_type = 'DOTADataset' default_hooks = dict( checkpoint=dict(interval=1, type='CheckpointHook'), logger=dict(interval=50, type='LoggerHook'), param_scheduler=dict(type='ParamSchedulerHook'), sampler_seed=dict(type='DistSamplerSeedHook'), timer=dict(type='IterTimerHook'), visualization=dict(type='mmdet.DetVisualizationHook')) default_scope = 'mmrotate' env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) launcher = 'pytorch' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=True, type='LogProcessor', window_size=50) model = dict( backbone=dict( depth=50, frozen_stages=1, init_cfg=dict(checkpoint='torchvision://resnet50', type='Pretrained'), norm_cfg=dict(requires_grad=True, type='BN'), norm_eval=True, num_stages=4, out_indices=( 0, 1, 2, 3, ), style='pytorch', type='mmdet.ResNet'), data_preprocessor=dict( bgr_to_rgb=True, boxtype2tensor=False, mean=[ 123.675, 116.28, 103.53, ], pad_size_divisor=32, std=[ 58.395, 57.12, 57.375, ], type='mmdet.DetDataPreprocessor'), neck=dict( in_channels=[ 256, 512, 1024, 2048, ], num_outs=5, out_channels=256, type='mmdet.FPN'), roi_head=dict( bbox_head=dict( bbox_coder=dict( angle_version='le90', edge_swap=True, norm_factor=2, target_means=( 0.0, 0.0, 0.0, 0.0, 0.0, ), target_stds=( 0.1, 0.1, 0.2, 0.2, 0.1, ), type='DeltaXYWHTHBBoxCoder'), cls_predictor_cfg=dict(type='mmdet.Linear'), fc_out_channels=1024, in_channels=256, loss_bbox=dict( beta=1.0, loss_weight=1.0, type='mmdet.SmoothL1Loss'), loss_cls=dict( loss_weight=1.0, type='mmdet.CrossEntropyLoss', use_sigmoid=False), num_classes=15, predict_box_type='rbox', reg_class_agnostic=True, reg_predictor_cfg=dict(type='mmdet.Linear'), roi_feat_size=7, type='mmdet.Shared2FCBBoxHead'), bbox_roi_extractor=dict( featmap_strides=[ 4, 8, 16, 32, ], out_channels=256, roi_layer=dict(output_size=7, sampling_ratio=0, type='RoIAlign'), type='mmdet.SingleRoIExtractor'), type='mmdet.StandardRoIHead'), rpn_head=dict( anchor_generator=dict( ratios=[ 0.5, 1.0, 2.0, ], scales=[ 8, ], strides=[ 4, 8, 16, 32, 64, ], type='mmdet.AnchorGenerator', use_box_type=True), bbox_coder=dict( target_means=[ 0.0, 0.0, 0.0, 0.0, ], target_stds=[ 1.0, 1.0, 1.0, 1.0, ], type='DeltaXYWHHBBoxCoder', use_box_type=True), feat_channels=256, in_channels=256, loss_bbox=dict( beta=0.1111111111111111, loss_weight=1.0, type='mmdet.SmoothL1Loss'), loss_cls=dict( loss_weight=1.0, type='mmdet.CrossEntropyLoss', use_sigmoid=True), type='mmdet.RPNHead'), test_cfg=dict( rcnn=dict( max_per_img=2000, min_bbox_size=0, nms=dict(iou_threshold=0.1, type='nms_rotated'), nms_pre=2000, score_thr=0.05), rpn=dict( max_per_img=2000, min_bbox_size=0, nms=dict(iou_threshold=0.7, type='nms'), nms_pre=2000)), train_cfg=dict( rcnn=dict( assigner=dict( ignore_iof_thr=-1, iou_calculator=dict(type='RBbox2HBboxOverlaps2D'), match_low_quality=False, min_pos_iou=0.5, neg_iou_thr=0.5, pos_iou_thr=0.5, type='mmdet.MaxIoUAssigner'), debug=False, pos_weight=-1, sampler=dict( add_gt_as_proposals=True, neg_pos_ub=-1, num=512, pos_fraction=0.25, type='mmdet.RandomSampler')), rpn=dict( allowed_border=0, assigner=dict( ignore_iof_thr=-1, iou_calculator=dict(type='RBbox2HBboxOverlaps2D'), match_low_quality=True, min_pos_iou=0.3, neg_iou_thr=0.3, pos_iou_thr=0.7, type='mmdet.MaxIoUAssigner'), debug=False, pos_weight=-1, sampler=dict( add_gt_as_proposals=False, neg_pos_ub=-1, num=256, pos_fraction=0.5, type='mmdet.RandomSampler')), rpn_proposal=dict( max_per_img=2000, min_bbox_size=0, nms=dict(iou_threshold=0.7, type='nms'), nms_pre=2000)), type='mmdet.FasterRCNN') optim_wrapper = dict( clip_grad=dict(max_norm=35, norm_type=2), optimizer=dict(lr=0.02, momentum=0.9, type='SGD', weight_decay=0.0001), type='OptimWrapper') param_scheduler = [ dict( begin=0, by_epoch=False, end=500, start_factor=0.3333333333333333, type='LinearLR'), dict( begin=0, by_epoch=True, end=12, gamma=0.1, milestones=[ 8, 11, ], type='MultiStepLR'), ] resume = False test_cfg = dict(type='TestLoop') test_dataloader = dict( batch_size=1, dataset=dict( data_prefix=dict(img_path='val/images/'), data_root='../DOTASplit/', pipeline=[ dict(backend_args=None, type='mmdet.LoadImageFromFile'), dict(keep_ratio=True, scale=( 1024, 1024, ), type='mmdet.Resize'), dict( meta_keys=( 'img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', ), type='mmdet.PackDetInputs'), ], test_mode=True, type='DOTADataset'), drop_last=False, num_workers=2, persistent_workers=True, sampler=dict(shuffle=False, type='DefaultSampler')) test_evaluator = dict( format_only=True, merge_patches=True, outfile_prefix='./work_dirs/r50_test', type='DOTAMetric') test_pipeline = [ dict(backend_args=None, type='mmdet.LoadImageFromFile'), dict(keep_ratio=True, scale=( 1024, 1024, ), type='mmdet.Resize'), dict( meta_keys=( 'img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', ), type='mmdet.PackDetInputs'), ] train_cfg = dict(max_epochs=12, type='EpochBasedTrainLoop', val_interval=1) train_dataloader = dict( batch_sampler=None, batch_size=4, dataset=dict( ann_file='train/labelTxt/', data_prefix=dict(img_path='train/images/'), data_root='../DOTASplit/', filter_cfg=dict(filter_empty_gt=True), pipeline=[ dict(backend_args=None, type='mmdet.LoadImageFromFile'), dict( box_type='qbox', type='mmdet.LoadAnnotations', with_bbox=True), dict( box_type_mapping=dict(gt_bboxes='rbox'), type='ConvertBoxType'), dict(keep_ratio=True, scale=( 1024, 1024, ), type='mmdet.Resize'), dict( direction=[ 'horizontal', 'vertical', 'diagonal', ], prob=0.75, type='mmdet.RandomFlip'), dict(type='mmdet.PackDetInputs'), ], type='DOTADataset'), num_workers=4, persistent_workers=True, sampler=dict(shuffle=True, type='DefaultSampler')) train_pipeline = [ dict(backend_args=None, type='mmdet.LoadImageFromFile'), dict(box_type='qbox', type='mmdet.LoadAnnotations', with_bbox=True), dict(box_type_mapping=dict(gt_bboxes='rbox'), type='ConvertBoxType'), dict(keep_ratio=True, scale=( 1024, 1024, ), type='mmdet.Resize'), dict( direction=[ 'horizontal', 'vertical', 'diagonal', ], prob=0.75, type='mmdet.RandomFlip'), dict(type='mmdet.PackDetInputs'), ] val_cfg = dict(type='ValLoop') val_dataloader = dict( batch_size=1, dataset=dict( ann_file='val/labelTxt/', data_prefix=dict(img_path='val/images/'), data_root='../DOTASplit/', pipeline=[ dict(backend_args=None, type='mmdet.LoadImageFromFile'), dict(keep_ratio=True, scale=( 1024, 1024, ), type='mmdet.Resize'), dict( box_type='qbox', type='mmdet.LoadAnnotations', with_bbox=True), dict( box_type_mapping=dict(gt_bboxes='rbox'), type='ConvertBoxType'), dict( meta_keys=( 'img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', ), type='mmdet.PackDetInputs'), ], test_mode=True, type='DOTADataset'), drop_last=False, num_workers=2, persistent_workers=True, sampler=dict(shuffle=False, type='DefaultSampler')) val_evaluator = dict(metric='mAP', type='DOTAMetric') val_pipeline = [ dict(backend_args=None, type='mmdet.LoadImageFromFile'), dict(keep_ratio=True, scale=( 1024, 1024, ), type='mmdet.Resize'), dict(box_type='qbox', type='mmdet.LoadAnnotations', with_bbox=True), dict(box_type_mapping=dict(gt_bboxes='rbox'), type='ConvertBoxType'), dict( meta_keys=( 'img_id', 'img_path', 'ori_shape', 'img_shape', 'scale_factor', ), type='mmdet.PackDetInputs'), ] vis_backends = [ dict(type='LocalVisBackend'), ] visualizer = dict( name='visualizer', type='RotLocalVisualizer', vis_backends=[ dict(type='LocalVisBackend'), ]) work_dir = 'work_dirs/r50_test/'

01/30 13:09:01 - mmengine - INFO - Hooks will be executed in the following order: before_run: (VERY_HIGH ) RuntimeInfoHook (BELOW_NORMAL) LoggerHook

before_train: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (VERY_LOW ) CheckpointHook

before_train_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (NORMAL ) DistSamplerSeedHook

before_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook

after_train_iter: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

after_train_epoch: (NORMAL ) IterTimerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

before_val: (VERY_HIGH ) RuntimeInfoHook

before_val_epoch: (NORMAL ) IterTimerHook

before_val_iter: (NORMAL ) IterTimerHook

after_val_iter: (NORMAL ) IterTimerHook (NORMAL ) DetVisualizationHook (BELOW_NORMAL) LoggerHook

after_val_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook (LOW ) ParamSchedulerHook (VERY_LOW ) CheckpointHook

after_val: (VERY_HIGH ) RuntimeInfoHook

after_train: (VERY_HIGH ) RuntimeInfoHook (VERY_LOW ) CheckpointHook

before_test: (VERY_HIGH ) RuntimeInfoHook

before_test_epoch: (NORMAL ) IterTimerHook

before_test_iter: (NORMAL ) IterTimerHook

after_test_iter: (NORMAL ) IterTimerHook (NORMAL ) DetVisualizationHook (BELOW_NORMAL) LoggerHook

after_test_epoch: (VERY_HIGH ) RuntimeInfoHook (NORMAL ) IterTimerHook (BELOW_NORMAL) LoggerHook

after_test: (VERY_HIGH ) RuntimeInfoHook

after_run: (BELOW_NORMAL) LoggerHook

01/30 13:14:23 - mmengine - WARNING - Failed to search registry with scope "mmrotate" in the "optim_wrapper" registry tree. As a workaround, the current "optim_wrapper" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmrotate" is a correct scope, or whether the registry is initialized. discoverable01/30 13:16:11 - mmengine - INFO - load model from: torchvision://resnet50 01/30 13:16:11 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50 01/30 13:16:11 - mmengine - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

01/30 13:16:12 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 01/30 13:16:12 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 01/30 13:16:12 - mmengine - INFO - Checkpoints will be saved to /media/Raid/WuMingrui/mmrotate-dev-1.x/work_dirs/r50_test. /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') /media/Raid/WuMingrui/mmrotate-dev-1.x/mmrotate/structures/bbox/rotated_boxes.py:192: UserWarning: The clip function does nothing in RotatedBoxes. warnings.warn('The clip function does nothing in RotatedBoxes.') 01/30 13:16:31 - mmengine - INFO - Epoch(train) [1][ 50/1285] lr: 7.9760e-03 eta: 1:38:12 time: 0.3834 data_time: 0.0137 memory: 8161 grad_norm: 4.2167 loss: 1.2448 loss_rpn_cls: 0.4039 loss_rpn_bbox: 0.0778 loss_cls: 0.4934 acc: 99.3652 loss_bbox: 0.2697 01/30 13:16:40 - mmengine - INFO - Epoch(train) [1][ 100/1285] lr: 9.3120e-03 eta: 1:13:16 time: 0.1907 data_time: 0.0071 memory: 6887 grad_norm: nan loss: nan loss_rpn_cls: nan loss_rpn_bbox: nan loss_cls: nan acc: 1.6393 loss_bbox: nan 01/30 13:16:49 - mmengine - INFO - Epoch(train) [1][ 150/1285] lr: 1.0648e-02 eta: 1:03:27 time: 0.1740 data_time: 0.0075 memory: 11067 grad_norm: nan loss: nan loss_rpn_cls: nan loss_rpn_bbox: nan loss_cls: nan acc: 27.2727 loss_bbox: nan 01/30 13:16:58 - mmengine - INFO - Epoch(train) [1][ 200/1285] lr: 1.1984e-02 eta: 0:58:24 time: 0.1731 data_time: 0.0073 memory: 7535 grad_norm: nan loss: nan loss_rpn_cls: nan loss_rpn_bbox: nan loss_cls: nan acc: 12.1212 loss_bbox: nan 01/30 13:17:06 - mmengine - INFO - Epoch(train) [1][ 250/1285] lr: 1.3320e-02 eta: 0:55:23 time: 0.1742 data_time: 0.0068 memory: 7676 grad_norm: nan loss: nan loss_rpn_cls: nan loss_rpn_bbox: nan loss_cls: nan acc: 48.1481 loss_bbox: nan 01/30 13:17:15 - mmengine - INFO - Epoch(train) [1][ 300/1285] lr: 1.4656e-02 eta: 0:53:18 time: 0.1738 data_time: 0.0071 memory: 7579 grad_norm: nan loss: nan loss_rpn_cls: nan loss_rpn_bbox: nan loss_cls: nan acc: 0.0000 loss_bbox: nan 01/30 13:17:24 - mmengine - INFO - Epoch(train) [1][ 350/1285] lr: 1.5992e-02 eta: 0:51:48 time: 0.1750 data_time: 0.0070 memory: 9612 grad_norm: nan loss: nan loss_rpn_cls: nan loss_rpn_bbox: nan loss_cls: nan acc: 43.8202 loss_bbox: nan 01/30 13:17:32 - mmengine - INFO - Epoch(train) [1][ 400/1285] lr: 1.7328e-02 eta: 0:50:39 time: 0.1749 data_time: 0.0071 memory: 8811 grad_norm: nan loss: nan loss_rpn_cls: nan loss_rpn_bbox: nan loss_cls: nan acc: 11.1111 loss_bbox: nan

Additional information

I tried mmrotate0.3.4 on the same dataset and it worked well. Then I tried rotated retinanet on mmrotatedev1.x, it still works well. I also tried to decrease LR, but the same problem happened. I suspect there may be some problem with my environment but cannot figure it out, which is CUDA 11.6 Pytorch1.12.1 MMEngine0.10.3 mmcv2.0.1 mmdet3.0.0rc6 mmrotate1.0.0rc1

kristupas-g commented 7 months ago

I appear to be having the same cause, tried on 0.3.4 and it worked fine, dev1.x didn't work