shaunyuan22 / SODA-mmrotate

SODA-A Small Object Detection Toolbox and Benchmark
https://shaunyuan22.github.io/SODA/
Apache License 2.0
39 stars 6 forks source link

[Feature] 再训练完一个epoch后会一直卡着不动,请问如何解决此类问题 #8

Closed 2276924877 closed 1 year ago

2276924877 commented 1 year ago

What's the feature?

选取10张原始数据作为训练集,6张原始数据作为验证集,经过裁剪后trian数据为162张,val数据为219张 在训练完一个epoch后会卡着不动 mmrotate 0.3.4 torch 1.9.1 CUDA 11.1 训练GPU RTX3090 24G

image

Any other context?

No response

shaunyuan22 commented 1 year ago

it seems that you perform evaluation after each epoch, please evaluate the performance when the training completed by adding the following config: evaluation = dict(interval=12, metric='mAP')

chnu-cpl commented 1 year ago

请问楼主解决问题了吗??请问一下作者我设置的是evaluation = dict(interval=12, metric='mAP'),在训练完成后进行的评估,一直卡着不动,不管是验证还是最终的测试,我是在整个数据集上进行的,第一张图片是验证时,第二张图片是测试时。 _3JZRFY}Z210C_(YT Y@Y4M H4_6 M$BC(CO18QULU9AD0T

shaunyuan22 commented 1 year ago

please provide more information such as model config and hardware platform information.

chnu-cpl commented 1 year ago

please provide more information such as model config and hardware platform information.

sys.platform: linux Python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) [GCC 9.4.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr NVCC: Cuda compilation tools, release 11.5, V11.5.119 GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 1.10.0+cu113 PyTorch compiling details: PyTorch built with:

TorchVision: 0.11.1+cu113 OpenCV: 4.7.0 MMCV: 1.6.1 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.3 MMRotate: 0.3.2+

2023-09-02 15:30:51,400 - mmrotate - INFO - Distributed training: False 2023-09-02 15:30:51,472 - mmrotate - INFO - Config: dataset_type = 'SODAADataset' data_root = '/home/cpl/dataset/SODA-A/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1200, 1200)), dict( type='RRandomFlip', flip_ratio=[0.25, 0.25, 0.25], direction=['horizontal', 'vertical', 'diagonal'], version='le135'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1200, 1200), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=2, workers_per_gpu=2, train=dict( type='SODAADataset', ann_file='/home/cpl/dataset/SODA-A/train/Annotations/', img_prefix='/home/cpl/dataset/SODA-A/train/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1200, 1200)), dict( type='RRandomFlip', flip_ratio=[0.25, 0.25, 0.25], direction=['horizontal', 'vertical', 'diagonal'], version='le135'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ], ori_ann_file='/home/cpl/dataset/SODA-A/Annotations/train/', version='le135'), val=dict( type='SODAADataset', ann_file='/home/cpl/dataset/SODA-A/val/Annotations/', img_prefix='/home/cpl/dataset/SODA-A/val/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1200, 1200), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ], ori_ann_file='/home/cpl/dataset/SODA-A/Annotations/val/', version='le135'), test=dict( type='SODAADataset', ann_file='/home/cpl/dataset/SODA-A/test/Annotations/', img_prefix='/home/cpl/dataset/SODA-A/test/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1200, 1200), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ], ori_ann_file='/home/cpl/dataset/SODA-A/Annotations/test/', version='le135')) evaluation = dict(interval=12, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=12) checkpoint_config = dict(interval=12) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' angle_version = 'le135'

shaunyuan22 commented 1 year ago

The enviroment configuration seems no problem. Could you provide the model config? And is this issue occurring during testing for all models, or is it specific to certain models during testing?

chnu-cpl commented 1 year ago

The model config

The model config is configs/sodaa-benchmarks/rotated_retinanet_obb_r50_fpn_1x.py.Because I haven't trained on any other models yet, I had a stuck problem testing only on this profile. If this situation is accidental, I will follow up with more model training.

shaunyuan22 commented 1 year ago

Alright, the data and code issues can be ruled out. In the future, we will update the evaluation code to improve the execution speed and robustness, which may address the problem you encountered.

wenzx18 commented 4 months ago

What's the feature?

选取10张原始数据作为训练集,6张原始数据作为验证集,经过裁剪后trian数据为162张,val数据为219张 在训练完一个epoch后会卡着不动 mmrotate 0.3.4 torch 1.9.1 CUDA 11.1 训练GPU RTX3090 24G

image

Any other context?

No response

You can try to set nproc=0 at line 412 in sodaa.py:

merged_results = self.merge_det(results, nproc=0)

It won't use multiprocessing module, and works for me.