open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.34k stars 749 forks source link

train loss large jumps in some datasets (dbnetpp) #1377

Open hugotong6425 opened 2 years ago

hugotong6425 commented 2 years ago

During training, the train loss will rise dramatically like the following screenshot.

image

I used mmocr provided scripts to process datasets (https://mmocr.readthedocs.io/en/latest/datasets/det.html).

Some datasets do not have this problem: image

config I used:

_base_ = [
    '../../_base_/default_runtime.py',
    '../../_base_/schedules/schedule_adam_600e.py',
    '../../_base_/det_models/dbnetpp_r50dcnv2_fpnc.py',
    '../../_base_/det_datasets/combine_dataset.py',
    '../../_base_/det_pipelines/dbnet_pipeline_custom.py'
]

train_list = {{_base_.train_list}}
test_list = {{_base_.test_list}}

train_pipeline_r50dcnv2 = {{_base_.train_pipeline_r50dcnv2}}
test_pipeline_4068_1024 = {{_base_.test_pipeline_4068_1024}}

load_from = '/home/mmocr/pretrained_weights/dbnetpp_r50dcnv2_fpnc_100k_iter_synthtext-20220502-db297554.pth'

model = dict(
    bbox_head=dict(
        postprocessor=dict(
            # type='DBPostprocessor', text_repr_type='quad',
            type='DBPostprocessor', text_repr_type='poly',
            epsilon_ratio=0.002
        )
    )
)

data = dict(
    samples_per_gpu=12,
    workers_per_gpu=8,
    val_dataloader=dict(samples_per_gpu=1),
    test_dataloader=dict(samples_per_gpu=1),
    train=dict(
        type='UniformConcatDataset',
        datasets=train_list,
        pipeline=train_pipeline_r50dcnv2),
    val=dict(
        type='UniformConcatDataset',
        datasets=test_list,
        pipeline=test_pipeline_4068_1024),
    test=dict(
        type='UniformConcatDataset',
        datasets=test_list,
        pipeline=test_pipeline_4068_1024))

evaluation = dict(
    interval=1,
    metric='hmean-iou',
    save_best='0_hmean-iou:hmean',
    rule='greater')

Things I tried (but no use):

  1. add clip_invalid_ploys=False
  2. disable rotate and scale augmentation

I found the same problem occurring in several datasets, including lsvt, textocr and hiertext.

Seems like if I use schedule_adadelta_18e.py, the loss will not large jump, but I am not sure if using grad_clip=dict(max_norm=0.5) will affect the model performance or not.

# schedule_adadelta_18e.py
# optimizer
optimizer = dict(type='Adadelta', lr=0.5)
optimizer_config = dict(grad_clip=dict(max_norm=0.5))
# learning policy
lr_config = dict(policy='step', step=[8, 14, 16])
# running settings
runner = dict(type='EpochBasedRunner', max_epochs=18)
checkpoint_config = dict(interval=1)

Any ideas on the root cause of the loss problem and how to solve it? Thanks for the help!

xinke-wang commented 2 years ago

Hi, thank you for reporting the issues.

We'll look into this as soon as possible. However, due to other high priority development schedules and the fact that we do not currently have models trained on these datasets available, it may take some time for us to resolve the issue.

I will let you know once we have any updates. Also, if you find any bugs later, welcome to discuss here or raise a PR to fix it. Thank you for your understanding.

hugotong6425 commented 1 year ago

Update: I tried to train dbnetpp using 1.x branch and it is observed that the loss will always jump a little bit and then rapidly drops to a certain value at the start of the 2nd epoch (I circle the location in dark blue). Most of the time the loss will stop dropping (like the green line, light green line and the light blue line). If I was very lucky, the loss will keep dropping like the pink line and the blue line, but the problem of loss sudden jumping still exists.

image

Also provide the grad_norm graph here.

image

Please let me know if anyone has an idea on explaining / solving this problem.

hugotong6425 commented 1 year ago

Update: Using sgd can make the training converge. I ran 300 epochs on IC15, IC17, MTWI, ReCTS and SROIE and the eval result of the final epoch is listed below in case anyone is interested. But the problem of 2nd epoch still exists, fortunately it does not affect the training in later epochs.

2023/03/21 00:39:24 - mmengine - INFO - Epoch(val) [300][521/521] IC15/icdar/precision: 0.8326 IC15/icdar/recall: 0.7780 IC15/icdar/hmean: 0.8044 IC17/icdar/precision: 0.8679 IC17/icdar/recall: 0.7063 IC17/icdar/hmean: 0.7788 MTWI/icdar/precision: 0.8009 MTWI/icdar/recall: 0.7122 MTWI/icdar/hmean: 0.7539 ReCTS/icdar/precision: 0.7508 ReCTS/icdar/recall: 0.7216 ReCTS/icdar/hmean: 0.7359 SROIE/icdar/precision: 0.8439 SROIE/icdar/recall: 0.8878 SROIE/icdar/hmean: 0.8653

@Harold-lkk please feel free to close the issue. Thanks.