open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.37k stars 754 forks source link

CRNN toy (own data) loss: nan loss_ctc: nan #1996

Open anbo724 opened 1 year ago

anbo724 commented 1 year ago

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmocr

Environment

crnn_mini-vgg_5e_toy.py

training schedule for 1x

base = [ '../base/default_runtime.py', '../base/datasets/ipa_data.py', '../base/schedules/schedule_adadelta_5e.py', '_base_crnn_mini-vgg.py', ]

dataset settings

train_list = [base.toy_rec_train] test_list = [base.toy_rec_test]

default_hooks = dict(logger=dict(type='LoggerHook', interval=50), )

train_dataloader = dict( batch_size=256, num_workers=8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=dict( type='ConcatDataset', datasets=train_list, pipeline=base.train_pipeline)) val_dataloader = dict( batch_size=1, num_workers=4, persistent_workers=True, drop_last=False, sampler=dict(type='DefaultSampler', shuffle=False), dataset=dict( type='ConcatDataset', datasets=test_list, pipeline=base.test_pipeline)) test_dataloader = val_dataloader

base.model.decoder.dictionary.update( dict(with_unknown=True, unknown_token=None)) base.train_cfg.update(dict(max_epochs=200, val_interval=10))

val_evaluator = dict(dataset_prefixes=['ipa']) test_evaluator = val_evaluator

ipa_data.py

toy_data_root = '/home/lcj/mmocr/data/recog/ipa10w/'

toy_rec_train = dict( type='OCRDataset', data_root=toy_data_root, data_prefix=dict(img_path='images/'), ann_file='train_labels.json', pipeline=None, test_mode=False)

toy_rec_test = dict( type='OCRDataset', data_root=toy_data_root, data_prefix=dict(img_path='images/'), ann_file='test_labels.json', pipeline=None, test_mode=True)

Reproduces the problem - code sample

CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/textrecog/crnn/crnn_mini-vgg_5e_toy.py --work-dir myipa/

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/textrecog/crnn/crnn_mini-vgg_5e_toy.py --work-dir myipa/

Reproduces the problem - error message

10/07 23:13:52 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/latest/api/fileio.html#file-io 10/07 23:13:52 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future. 10/07 23:13:52 - mmengine - INFO - Checkpoints will be saved to /home/lcj/mmocr/myipa. 10/07 23:14:03 - mmengine - INFO - Epoch(train) [1][ 50/391] lr: 1.0000e+00 eta: 4:41:58 time: 0.1844 data_time: 0.1028 memory: 1426 loss: 3.2401 loss_ctc: 3.2401 10/07 23:14:11 - mmengine - INFO - Epoch(train) [1][100/391] lr: 1.0000e+00 eta: 3:57:24 time: 0.1422 data_time: 0.0513 memory: 1426 loss: 3.1053 loss_ctc: 3.1053 10/07 23:14:18 - mmengine - INFO - Epoch(train) [1][150/391] lr: 1.0000e+00 eta: 3:44:46 time: 0.1379 data_time: 0.0508 memory: 1426 loss: 2.8808 loss_ctc: 2.8808 10/07 23:14:26 - mmengine - INFO - Epoch(train) [1][200/391] lr: 1.0000e+00 eta: 3:37:39 time: 0.1295 data_time: 0.0486 memory: 1426 loss: 2.9587 loss_ctc: 2.9587 10/07 23:14:34 - mmengine - INFO - Epoch(train) [1][250/391] lr: 1.0000e+00 eta: 3:34:55 time: 0.1951 data_time: 0.1082 memory: 1426 loss: 2.7018 loss_ctc: 2.7018 10/07 23:14:41 - mmengine - INFO - Epoch(train) [1][300/391] lr: 1.0000e+00 eta: 3:32:02 time: 0.1376 data_time: 0.0502 memory: 1426 loss: 2.4804 loss_ctc: 2.4804 10/07 23:14:49 - mmengine - INFO - Epoch(train) [1][350/391] lr: 1.0000e+00 eta: 3:30:05 time: 0.1351 data_time: 0.0509 memory: 1426 loss: nan loss_ctc: nan 10/07 23:14:55 - mmengine - INFO - Exp name: crnn_mini-vgg_5e_toy_20231007_231344 10/07 23:14:55 - mmengine - INFO - Saving checkpoint at 1 epochs 10/07 23:15:05 - mmengine - INFO - Epoch(train) [2][ 50/391] lr: 1.0000e+00 eta: 3:31:22 time: 0.1907 data_time: 0.0931 memory: 1426 loss: nan loss_ctc: nan 10/07 23:15:13 - mmengine - INFO - Epoch(train) [2][100/391] lr: 1.0000e+00 eta: 3:29:32 time: 0.1414 data_time: 0.0608 memory: 1426 loss: nan loss_ctc: nan

Additional information

No response

Vegemo-bear commented 1 year ago

我训练master时,刚开始就出现nan,好奇怪

xReniar commented 8 months ago

@anbo724 have you solved it?

SolveProb commented 7 months ago

我训练master时,刚开始就出现nan,好奇怪

同样遇到 最开始的时候 loss 为 inf,后续的loss 都为 nan