[Feature] 再训练完一个epoch后会一直卡着不动，请问如何解决此类问题

2276924877 commented 1 year ago

What's the feature?

选取10张原始数据作为训练集，6张原始数据作为验证集，经过裁剪后trian数据为162张，val数据为219张在训练完一个epoch后会卡着不动 mmrotate 0.3.4 torch 1.9.1 CUDA 11.1 训练GPU RTX3090 24G

Any other context?

No response

shaunyuan22 commented 1 year ago

it seems that you perform evaluation after each epoch, please evaluate the performance when the training completed by adding the following config: evaluation = dict(interval=12, metric='mAP')

chnu-cpl commented 1 year ago

请问楼主解决问题了吗？？请问一下作者我设置的是evaluation = dict(interval=12, metric='mAP')，在训练完成后进行的评估，一直卡着不动，不管是验证还是最终的测试，我是在整个数据集上进行的，第一张图片是验证时，第二张图片是测试时。 _3JZRFY}Z210C_(YT Y@Y4M H4_6 M$BC(CO18QULU9AD0T

shaunyuan22 commented 1 year ago

please provide more information such as model config and hardware platform information.

chnu-cpl commented 1 year ago

please provide more information such as model config and hardware platform information.

sys.platform: linux Python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) [GCC 9.4.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr NVCC: Cuda compilation tools, release 11.5, V11.5.119 GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 PyTorch: 1.10.0+cu113 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1+cu113 OpenCV: 4.7.0 MMCV: 1.6.1 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.3 MMRotate: 0.3.2+

2023-09-02 15:30:51,400 - mmrotate - INFO - Distributed training: False 2023-09-02 15:30:51,472 - mmrotate - INFO - Config: dataset_type = 'SODAADataset' data_root = '/home/cpl/dataset/SODA-A/' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1200, 1200)), dict( type='RRandomFlip', flip_ratio=[0.25, 0.25, 0.25], direction=['horizontal', 'vertical', 'diagonal'], version='le135'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1200, 1200), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=2, workers_per_gpu=2, train=dict( type='SODAADataset', ann_file='/home/cpl/dataset/SODA-A/train/Annotations/', img_prefix='/home/cpl/dataset/SODA-A/train/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', with_bbox=True), dict(type='RResize', img_scale=(1200, 1200)), dict( type='RRandomFlip', flip_ratio=[0.25, 0.25, 0.25], direction=['horizontal', 'vertical', 'diagonal'], version='le135'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']) ], ori_ann_file='/home/cpl/dataset/SODA-A/Annotations/train/', version='le135'), val=dict( type='SODAADataset', ann_file='/home/cpl/dataset/SODA-A/val/Annotations/', img_prefix='/home/cpl/dataset/SODA-A/val/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1200, 1200), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ], ori_ann_file='/home/cpl/dataset/SODA-A/Annotations/val/', version='le135'), test=dict( type='SODAADataset', ann_file='/home/cpl/dataset/SODA-A/test/Annotations/', img_prefix='/home/cpl/dataset/SODA-A/test/Images/', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(1200, 1200), flip=False, transforms=[ dict(type='RResize'), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img']) ]) ], ori_ann_file='/home/cpl/dataset/SODA-A/Annotations/test/', version='le135')) evaluation = dict(interval=12, metric='mAP') optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2)) lr_config = dict( policy='step', warmup='linear', warmup_iters=500, warmup_ratio=0.3333333333333333, step=[8, 11]) runner = dict(type='EpochBasedRunner', max_epochs=12) checkpoint_config = dict(interval=12) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] opencv_num_threads = 0 mp_start_method = 'fork' angle_version = 'le135'

shaunyuan22 commented 1 year ago

The enviroment configuration seems no problem. Could you provide the model config? And is this issue occurring during testing for all models, or is it specific to certain models during testing?

chnu-cpl commented 1 year ago

The model config

The model config is configs/sodaa-benchmarks/rotated_retinanet_obb_r50_fpn_1x.py.Because I haven't trained on any other models yet, I had a stuck problem testing only on this profile. If this situation is accidental, I will follow up with more model training.

shaunyuan22 commented 1 year ago

Alright, the data and code issues can be ruled out. In the future, we will update the evaluation code to improve the execution speed and robustness, which may address the problem you encountered.

wenzx18 commented 4 months ago

What's the feature?

选取10张原始数据作为训练集，6张原始数据作为验证集，经过裁剪后trian数据为162张，val数据为219张在训练完一个epoch后会卡着不动 mmrotate 0.3.4 torch 1.9.1 CUDA 11.1 训练GPU RTX3090 24G

Any other context?

No response

You can try to set nproc=0 at line 412 in sodaa.py:

merged_results = self.merge_det(results, nproc=0)

It won't use multiprocessing module, and works for me.

shaunyuan22 / SODA-mmrotate