train sdmgr loss Nan - Githubissues

I have some error when training sdmgr model: ssh://root@172.16.135.60:2244/opt/conda/bin/python -u /mmocr/tools/train.py configs/kie/sdmgr/sdmgr_unet16_60e_subtitile_classify.py 2021-11-16 04:41:01,896 - mmocr - INFO - Environment info:

sys.platform: linux Python: 3.7.7 (default, Mar 23 2020, 22:36:06) [GCC 7.3.0] CUDA available: True GPU 0: GeForce RTX 2070 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.243 GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 PyTorch: 1.5.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.0a0+82fd1c8 OpenCV: 4.5.2 MMCV: 1.3.4 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMOCR: 0.2.0+ae90dea

2021-11-16 04:41:05,357 - mmocr - INFO - Distributed training: False 2021-11-16 04:41:08,726 - mmocr - INFO - Config: img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) max_scale = 1024 min_scale = 512 train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict(type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ] dataset_type = 'KIEDataset' data_root = '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train' loader = dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])) train = dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/train.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes', 'gt_labels']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=False) test = dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/test.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict(type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=True) data = dict( samples_per_gpu=1, workers_per_gpu=1, train=dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/train.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes', 'gt_labels']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=False), val=dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/test.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=True), test=dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/test.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=True)) evaluation = dict( interval=1, metric='macro_f1', metric_options=dict( macro_f1=dict( ignores=[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 25]))) model = dict( type='SDMGR', backbone=dict(type='UNet', base_channels=16), bbox_head=dict( type='SDMGRHead', visual_dim=16, num_chars=92, num_classes=26), visual_modality=True, train_cfg=None, test_cfg=None, class_list= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/class_list.txt' ) optimizer = dict(type='Adam', weight_decay=0.0001) optimizer_config = dict(grad_clip=None) lr_config = dict( policy='step', warmup='linear', warmup_iters=1, warmup_ratio=1, step=[40, 50]) total_epochs = 60 checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] find_unused_parameters = True work_dir = './work_dirs/sdmgr_unet16_60e_subtitile_classify' gpu_ids = range(0, 1)

/mmocr/mmocr/apis/train.py:79: UserWarning: config is now expected to have a runner section, please set runner in your config. 'please set runner in your config.', UserWarning) 2021-11-16 04:41:11,105 - mmocr - INFO - Start running, host: root@mrli-HP-Z440, work_dir: /mmocr/work_dirs/sdmgr_unet16_60e_subtitile_classify 2021-11-16 04:41:11,105 - mmocr - INFO - workflow: [('train', 1)], max: 60 epochs 2021-11-16 04:41:19,250 - mmocr - INFO - Epoch [1][50/59214] lr: 1.000e-03, eta: 6 days, 16:43:32, time: 0.163, data_time: 0.048, memory: 962, loss_node: nan, loss_edge: nan, acc_node: 11.0414, acc_edge: 4.0000, loss: nan 2021-11-16 04:41:25,053 - mmocr - INFO - Epoch [1][100/59214] lr: 1.000e-03, eta: 5 days, 17:37:30, time: 0.116, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 10.4146, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:30,881 - mmocr - INFO - Epoch [1][150/59214] lr: 1.000e-03, eta: 5 days, 10:05:07, time: 0.117, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 11.1532, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:36,720 - mmocr - INFO - Epoch [1][200/59214] lr: 1.000e-03, eta: 5 days, 6:22:24, time: 0.117, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 10.5853, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:42,385 - mmocr - INFO - Epoch [1][250/59214] lr: 1.000e-03, eta: 5 days, 3:27:33, time: 0.113, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 7.8421, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:48,179 - mmocr - INFO - Epoch [1][300/59214] lr: 1.000e-03, eta: 5 days, 1:56:09, time: 0.116, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 8.9169, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:53,983 - mmocr - INFO - Epoch [1][350/59214] lr: 1.000e-03, eta: 5 days, 0:52:50, time: 0.116, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 7.6211, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:59,853 - mmocr - INFO - Epoch [1][400/59214] lr: 1.000e-03, eta: 5 days, 0:14:48, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 8.9602, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:05,717 - mmocr - INFO - Epoch [1][450/59214] lr: 1.000e-03, eta: 4 days, 23:44:38, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.1780, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:11,605 - mmocr - INFO - Epoch [1][500/59214] lr: 1.000e-03, eta: 4 days, 23:23:11, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 11.6985, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:17,470 - mmocr - INFO - Epoch [1][550/59214] lr: 1.000e-03, eta: 4 days, 23:03:16, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 9.7816, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:23,296 - mmocr - INFO - Epoch [1][600/59214] lr: 1.000e-03, eta: 4 days, 22:42:42, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 11.2980, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:29,190 - mmocr - INFO - Epoch [1][650/59214] lr: 1.000e-03, eta: 4 days, 22:31:31, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 12.2074, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:34,988 - mmocr - INFO - Epoch [1][700/59214] lr: 1.000e-03, eta: 4 days, 22:13:48, time: 0.116, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.6120, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:40,757 - mmocr - INFO - Epoch [1][750/59214] lr: 1.000e-03, eta: 4 days, 21:56:10, time: 0.115, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 7.0793, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:46,679 - mmocr - INFO - Epoch [1][800/59214] lr: 1.000e-03, eta: 4 days, 21:52:01, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.7206, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:52,534 - mmocr - INFO - Epoch [1][850/59214] lr: 1.000e-03, eta: 4 days, 21:43:40, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 13.3489, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:58,445 - mmocr - INFO - Epoch [1][900/59214] lr: 1.000e-03, eta: 4 days, 21:39:52, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.2977, acc_edge: 0.0000, loss: nan 2021-11-16 04:43:04,358 - mmocr - INFO - Epoch [1][950/59214] lr: 1.000e-03, eta: 4 days, 21:36:40, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 8.5367, acc_edge: 0.0000, loss: nan 2021-11-16 04:43:10,178 - mmocr - INFO - Exp name: sdmgr_unet16_60e_subtitile_classify.py

the train data sample: {"file_name": "0001.江苏网络电视台-晚间新闻 20211025_500.jpg", "height": 720, "width": 1280, "annotations": [{"box": [116.0, 601.0, 116.0, 644.0, 192.0, 644.0, 192.0, 601.0], "label": 6, "text": "新闻"}, {"box": [117.0, 566.0, 117.0, 602.0, 190.0, 602.0, 190.0, 566.0], "label": 6, "text": "晚间"}, {"box": [911.0, 456.0, 911.0, 487.0, 1004.0, 487.0, 1004.0, 456.0], "label": 1, "text": "王柏文"}, {"box": [1049.0, 456.0, 1051.0, 489.0, 1123.0, 486.0, 1122.0, 454.0], "label": 1, "text": "主播"}, {"box": [1035.0, 72.0, 1035.0, 95.0, 1197.0, 97.0, 1197.0, 74.0], "label": 6, "text": "JsTVcomp"}, {"box": [197.0, 62.0, 197.0, 92.0, 311.0, 90.0, 311.0, 59.0], "label": 6, "text": "江苏卫视"}, {"box": [1036.0, 31.0, 1036.0, 68.0, 1183.0, 68.0, 1183.0, 31.0], "label": 6, "text": "凉枝叹"}]}

open-mmlab / mmcv

train sdmgr loss Nan #1494

I have some error when training sdmgr model: ssh://root@172.16.135.60:2244/opt/conda/bin/python -u /mmocr/tools/train.py configs/kie/sdmgr/sdmgr_unet16_60e_subtitile_classify.py 2021-11-16 04:41:01,896 - mmocr - INFO - Environment info:

TorchVision: 0.6.0a0+82fd1c8 OpenCV: 4.5.2 MMCV: 1.3.4 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMOCR: 0.2.0+ae90dea