Closed mrlihellohorld closed 2 years ago
You have used Chinese in data but num_chars
in SDMGRHead
was still 92. You need to change num_chars
to be the actual number of characters in your dict.
Thank you for your help, i have modified the num_chars to 5111(the actual number of characters in my dict), num_classes to 7(Number of categories)
but there are some new problem when evaling:
021-11-16 11:44:41,993 - mmocr - INFO - workflow: [('train', 1)], max: 2 epochs
2021-11-16 11:44:49,954 - mmocr - INFO - Epoch [1][50/361] lr: 1.000e-03, eta: 0:01:46, time: 0.159, data_time: 0.045, memory: 997, loss_node: 0.8057, loss_edge: 0.0423, acc_node: 76.2714, acc_edge: 97.9286, loss: 0.8480
2021-11-16 11:44:55,909 - mmocr - INFO - Epoch [1][100/361] lr: 1.000e-03, eta: 0:01:26, time: 0.119, data_time: 0.003, memory: 997, loss_node: 0.5725, loss_edge: 0.0009, acc_node: 80.6275, acc_edge: 100.0000, loss: 0.5734
2021-11-16 11:45:01,839 - mmocr - INFO - Epoch [1][150/361] lr: 1.000e-03, eta: 0:01:15, time: 0.119, data_time: 0.003, memory: 997, loss_node: 0.4464, loss_edge: 0.0019, acc_node: 84.7223, acc_edge: 100.0000, loss: 0.4482
2021-11-16 11:45:07,592 - mmocr - INFO - Epoch [1][200/361] lr: 1.000e-03, eta: 0:01:06, time: 0.115, data_time: 0.003, memory: 997, loss_node: 0.4829, loss_edge: 0.0008, acc_node: 85.5441, acc_edge: 100.0000, loss: 0.4837
2021-11-16 11:45:13,536 - mmocr - INFO - Epoch [1][250/361] lr: 1.000e-03, eta: 0:00:59, time: 0.119, data_time: 0.004, memory: 997, loss_node: 0.4424, loss_edge: 0.0012, acc_node: 87.4994, acc_edge: 100.0000, loss: 0.4436
2021-11-16 11:45:19,420 - mmocr - INFO - Epoch [1][300/361] lr: 1.000e-03, eta: 0:00:52, time: 0.118, data_time: 0.004, memory: 997, loss_node: 0.3853, loss_edge: 0.0007, acc_node: 87.0350, acc_edge: 100.0000, loss: 0.3860
2021-11-16 11:45:25,337 - mmocr - INFO - Epoch [1][350/361] lr: 1.000e-03, eta: 0:00:46, time: 0.118, data_time: 0.004, memory: 997, loss_node: 0.3552, loss_edge: 0.0007, acc_node: 87.0647, acc_edge: 100.0000, loss: 0.3559
[>>>>>>>>>>>>>>>>>>>>>>>>>>>] 1705/1705, 32.6 task/s, elapsed: 52s, ETA: 0sTraceback (most recent call last):
File "/mmocr/tools/train.py", line 212, in
num_chars
should be dict_size + 1
since blank
is added in https://github.com/open-mmlab/mmocr/blob/main/mmocr/datasets/kie_dataset.py#L59. Please change num_chars
to 5112
If the problem is still there, please attach a piece of your annotation file (like 100 lines) and dictionary file here, and we will use them to locate the problem.
I have some error when training sdmgr model: ssh://root@172.16.135.60:2244/opt/conda/bin/python -u /mmocr/tools/train.py configs/kie/sdmgr/sdmgr_unet16_60e_subtitile_classify.py 2021-11-16 04:41:01,896 - mmocr - INFO - Environment info: sys.platform: linux Python: 3.7.7 (default, Mar 23 2020, 22:36:06) [GCC 7.3.0] CUDA available: True GPU 0: GeForce RTX 2070 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.243 GCC: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 PyTorch: 1.5.0 PyTorch compiling details: PyTorch built with:
GCC 7.3 C++ Version: 201402 Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc) OpenMP 201511 (a.k.a. OpenMP 4.5) NNPACK is enabled CPU capability usage: AVX2 CUDA Runtime 10.1 NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37 CuDNN 7.6.3 Magma 2.5.2 Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, TorchVision: 0.6.0a0+82fd1c8 OpenCV: 4.5.2 MMCV: 1.3.4 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMOCR: 0.2.0+ae90dea 2021-11-16 04:41:05,357 - mmocr - INFO - Distributed training: False 2021-11-16 04:41:08,726 - mmocr - INFO - Config: img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True) max_scale = 1024 min_scale = 512 train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes', 'gt_labels']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict(type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ] dataset_type = 'KIEDataset' data_root = '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train' loader = dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])) train = dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/train.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes', 'gt_labels']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=False) test = dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/test.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict(type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=True) data = dict( samples_per_gpu=1, workers_per_gpu=1, train=dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/train.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes', 'gt_labels']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=False), val=dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/test.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=True), test=dict( type='KIEDataset', ann_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/test.txt', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations'), dict(type='Resize', img_scale=(1024, 512), keep_ratio=True), dict(type='RandomFlip', flip_ratio=0.0), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True), dict(type='Pad', size_divisor=32), dict(type='KIEFormatBundle'), dict( type='Collect', keys=['img', 'relations', 'texts', 'gt_bboxes']) ], img_prefix= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train', loader=dict( type='HardDiskLoader', repeat=1, parser=dict( type='LineJsonParser', keys=['file_name', 'height', 'width', 'annotations'])), dict_file= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/dict.txt', test_mode=True)) evaluation = dict( interval=1, metric='macro_f1', metric_options=dict( macro_f1=dict( ignores=[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 25]))) model = dict( type='SDMGR', backbone=dict(type='UNet', base_channels=16), bbox_head=dict( type='SDMGRHead', visual_dim=16, num_chars=92, num_classes=26), visual_modality=True, train_cfg=None, test_cfg=None, class_list= '/data/labels_convert/KIE_subtitle_classify/video_frame_dataset_train/class_list.txt' ) optimizer = dict(type='Adam', weight_decay=0.0001) optimizer_config = dict(grad_clip=None) lr_config = dict( policy='step', warmup='linear', warmup_iters=1, warmup_ratio=1, step=[40, 50]) total_epochs = 60 checkpoint_config = dict(interval=1) log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] find_unused_parameters = True work_dir = './work_dirs/sdmgr_unet16_60e_subtitile_classify' gpu_ids = range(0, 1)
/mmocr/mmocr/apis/train.py:79: UserWarning: config is now expected to have a runner section, please set runner in your config. 'please set runner in your config.', UserWarning) 2021-11-16 04:41:11,105 - mmocr - INFO - Start running, host: root@mrli-HP-Z440, work_dir: /mmocr/work_dirs/sdmgr_unet16_60e_subtitile_classify 2021-11-16 04:41:11,105 - mmocr - INFO - workflow: [('train', 1)], max: 60 epochs 2021-11-16 04:41:19,250 - mmocr - INFO - Epoch [1][50/59214] lr: 1.000e-03, eta: 6 days, 16:43:32, time: 0.163, data_time: 0.048, memory: 962, loss_node: nan, loss_edge: nan, acc_node: 11.0414, acc_edge: 4.0000, loss: nan 2021-11-16 04:41:25,053 - mmocr - INFO - Epoch [1][100/59214] lr: 1.000e-03, eta: 5 days, 17:37:30, time: 0.116, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 10.4146, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:30,881 - mmocr - INFO - Epoch [1][150/59214] lr: 1.000e-03, eta: 5 days, 10:05:07, time: 0.117, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 11.1532, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:36,720 - mmocr - INFO - Epoch [1][200/59214] lr: 1.000e-03, eta: 5 days, 6:22:24, time: 0.117, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 10.5853, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:42,385 - mmocr - INFO - Epoch [1][250/59214] lr: 1.000e-03, eta: 5 days, 3:27:33, time: 0.113, data_time: 0.003, memory: 971, loss_node: nan, loss_edge: nan, acc_node: 7.8421, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:48,179 - mmocr - INFO - Epoch [1][300/59214] lr: 1.000e-03, eta: 5 days, 1:56:09, time: 0.116, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 8.9169, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:53,983 - mmocr - INFO - Epoch [1][350/59214] lr: 1.000e-03, eta: 5 days, 0:52:50, time: 0.116, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 7.6211, acc_edge: 0.0000, loss: nan 2021-11-16 04:41:59,853 - mmocr - INFO - Epoch [1][400/59214] lr: 1.000e-03, eta: 5 days, 0:14:48, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 8.9602, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:05,717 - mmocr - INFO - Epoch [1][450/59214] lr: 1.000e-03, eta: 4 days, 23:44:38, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.1780, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:11,605 - mmocr - INFO - Epoch [1][500/59214] lr: 1.000e-03, eta: 4 days, 23:23:11, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 11.6985, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:17,470 - mmocr - INFO - Epoch [1][550/59214] lr: 1.000e-03, eta: 4 days, 23:03:16, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 9.7816, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:23,296 - mmocr - INFO - Epoch [1][600/59214] lr: 1.000e-03, eta: 4 days, 22:42:42, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 11.2980, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:29,190 - mmocr - INFO - Epoch [1][650/59214] lr: 1.000e-03, eta: 4 days, 22:31:31, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 12.2074, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:34,988 - mmocr - INFO - Epoch [1][700/59214] lr: 1.000e-03, eta: 4 days, 22:13:48, time: 0.116, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.6120, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:40,757 - mmocr - INFO - Epoch [1][750/59214] lr: 1.000e-03, eta: 4 days, 21:56:10, time: 0.115, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 7.0793, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:46,679 - mmocr - INFO - Epoch [1][800/59214] lr: 1.000e-03, eta: 4 days, 21:52:01, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.7206, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:52,534 - mmocr - INFO - Epoch [1][850/59214] lr: 1.000e-03, eta: 4 days, 21:43:40, time: 0.117, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 13.3489, acc_edge: 0.0000, loss: nan 2021-11-16 04:42:58,445 - mmocr - INFO - Epoch [1][900/59214] lr: 1.000e-03, eta: 4 days, 21:39:52, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 10.2977, acc_edge: 0.0000, loss: nan 2021-11-16 04:43:04,358 - mmocr - INFO - Epoch [1][950/59214] lr: 1.000e-03, eta: 4 days, 21:36:40, time: 0.118, data_time: 0.003, memory: 1132, loss_node: nan, loss_edge: nan, acc_node: 8.5367, acc_edge: 0.0000, loss: nan 2021-11-16 04:43:10,178 - mmocr - INFO - Exp name: sdmgr_unet16_60e_subtitile_classify.py Traceback (most recent call last): File "/mmocr/tools/train.py", line 212, in
main()
File "/mmocr/tools/train.py", line 208, in main
meta=meta)
File "/mmocr/mmocr/apis/train.py", line 161, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
self.call_hook('after_train_iter')
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 42, in after_train_iter
runner.optimizer.step()
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/optim/adam.py", line 96, in step
grad = grad.add(p, alpha=group['weight_decay'])
RuntimeError: CUDA error: an illegal memory access was encountered
Process finished with exit code 1
the train data sample: {"file_name": "0001.江苏网络电视台-晚间新闻 20211025_500.jpg", "height": 720, "width": 1280, "annotations": [{"box": [116.0, 601.0, 116.0, 644.0, 192.0, 644.0, 192.0, 601.0], "label": 6, "text": "新闻"}, {"box": [117.0, 566.0, 117.0, 602.0, 190.0, 602.0, 190.0, 566.0], "label": 6, "text": "晚间"}, {"box": [911.0, 456.0, 911.0, 487.0, 1004.0, 487.0, 1004.0, 456.0], "label": 1, "text": "王柏文"}, {"box": [1049.0, 456.0, 1051.0, 489.0, 1123.0, 486.0, 1122.0, 454.0], "label": 1, "text": "主播"}, {"box": [1035.0, 72.0, 1035.0, 95.0, 1197.0, 97.0, 1197.0, 74.0], "label": 6, "text": "JsTVcomp"}, {"box": [197.0, 62.0, 197.0, 92.0, 311.0, 90.0, 311.0, 59.0], "label": 6, "text": "江苏卫视"}, {"box": [1036.0, 31.0, 1036.0, 68.0, 1183.0, 68.0, 1183.0, 31.0], "label": 6, "text": "凉枝叹"}]}
The first question: why loss is Nan. and Why does cuda error occur? Thanks