open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.27k stars 743 forks source link

hmeanIOU always zero during training aka Training seems to be broke. #1726

Closed yCobanoglu closed 1 year ago

yCobanoglu commented 1 year ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

1.x branch https://github.com/open-mmlab/mmocr/tree/dev-1.x

Environment

sys.platform: linux
Python: 3.10.8 (main, Nov 1 2022, 14:18:21) [GCC 12.2.0]
CUDA available: False
numpy_random_seed: 2147483648
GCC: gcc (GCC) 12.2.0
PyTorch: 1.13.1+cpu
PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=0, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.14.1+cpu
OpenCV: 4.7.0
MMEngine: 0.5.0
MMOCR: 1.0.0rc5+a644c85

Reproduces the problem - code sample

python3 mmocr_tools/train.py configs/textdet/textsnake/textsnake_resnet50-oclip_fpn-unet_1200e_icdar2015.py

I tried multiple models. FCENET, MasRCNN all suffer the same issue. Overfitting one training data sample from icdar2015 and the same on for testing. During training on this one sample the HmeanIOU always stays at 0. Also once trained and tested manually (with the test script) not a single bounding box is predicted. To reproduce the issue just create data/icdar2015/ with the following instances_training.json and copy and instance_test.json with the same content. Add "img-468.jpg" from icdar2015. So there is one train sample and the exact same test sample. Train any model and see if the hmeanIO is 0. After the model has finished training try predicting with tool/test on the test set.

Here is the training instance.json

{
  "metainfo": {
    "dataset_type": "TextDetDataset",
    "task_name": "textdet",
    "category": [
      {
        "id": 0,
        "name": "text"
      }
    ]
  },
  "data_list": [
    {
      "instances": [
        {
          "polygon": [
            2,
            153,
            190,
            152,
            192,
            199,
            3,
            209
          ],
          "bbox": [
            2.0,
            152.0,
            192.0,
            209.0
          ],
          "bbox_label": 0,
          "ignore": false
        },
        {
          "polygon": [
            188,
            150,
            238,
            150,
            242,
            197,
            192,
            197
          ],
          "bbox": [
            188.0,
            150.0,
            242.0,
            197.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            1,
            247,
            202,
            232,
            206,
            270,
            2,
            291
          ],
          "bbox": [
            1.0,
            232.0,
            206.0,
            291.0
          ],
          "bbox_label": 0,
          "ignore": false
        },
        {
          "polygon": [
            217,
            237,
            310,
            226,
            308,
            254,
            215,
            265
          ],
          "bbox": [
            215.0,
            226.0,
            310.0,
            265.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            1,
            321,
            266,
            286,
            269,
            328,
            2,
            373
          ],
          "bbox": [
            1.0,
            286.0,
            269.0,
            373.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            268,
            289,
            328,
            277,
            331,
            318,
            271,
            330
          ],
          "bbox": [
            268.0,
            277.0,
            331.0,
            330.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            7,
            400,
            174,
            369,
            175,
            417,
            9,
            454
          ],
          "bbox": [
            7.0,
            369.0,
            175.0,
            454.0
          ],
          "bbox_label": 0,
          "ignore": false
        },
        {
          "polygon": [
            185,
            378,
            228,
            365,
            229,
            407,
            186,
            419
          ],
          "bbox": [
            185.0,
            365.0,
            229.0,
            419.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            3,
            564,
            159,
            519,
            170,
            561,
            4,
            619
          ],
          "bbox": [
            3.0,
            519.0,
            170.0,
            619.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            173,
            526,
            273,
            496,
            279,
            539,
            179,
            569
          ],
          "bbox": [
            173.0,
            496.0,
            279.0,
            569.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            477,
            64,
            576,
            56,
            575,
            78,
            476,
            86
          ],
          "bbox": [
            476.0,
            56.0,
            576.0,
            86.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            475,
            87,
            574,
            79,
            575,
            98,
            476,
            106
          ],
          "bbox": [
            475.0,
            79.0,
            575.0,
            106.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            582,
            203,
            703,
            206,
            704,
            226,
            583,
            223
          ],
          "bbox": [
            582.0,
            203.0,
            704.0,
            226.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            704,
            50,
            729,
            51,
            732,
            224,
            707,
            223
          ],
          "bbox": [
            704.0,
            50.0,
            732.0,
            224.0
          ],
          "bbox_label": 0,
          "ignore": false
        },
        {
          "polygon": [
            979,
            324,
            1025,
            321,
            1025,
            346,
            979,
            349
          ],
          "bbox": [
            979.0,
            321.0,
            1025.0,
            349.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            1070,
            316,
            1101,
            313,
            1104,
            340,
            1073,
            342
          ],
          "bbox": [
            1070.0,
            313.0,
            1104.0,
            342.0
          ],
          "bbox_label": 0,
          "ignore": true
        },
        {
          "polygon": [
            1030,
            325,
            1070,
            326,
            1070,
            356,
            1030,
            355
          ],
          "bbox": [
            1030.0,
            325.0,
            1070.0,
            356.0
          ],
          "bbox_label": 0,
          "ignore": false
        }
      ],
      "img_path": "test/img_468.jpg",
      "height": 720,
      "width": 1280,
      "seg_map": "test/gt_img_468.txt"
    }
  ]
}

Reproduces the problem - command or script

see above

Reproduces the problem - error message

02/15 01:25:56 - mmengine - INFO - Exp name: textsnake_20230215_011935
02/15 01:25:56 - mmengine - INFO - Exp name: textsnake_20230215_011935
02/15 01:25:56 - mmengine - INFO - Exp name: textsnake_20230215_011935
02/15 01:25:56 - mmengine - INFO - Exp name: textsnake_20230215_011935
02/15 01:25:57 - mmengine - INFO - Exp name: textsnake_20230215_011935
02/15 01:25:57 - mmengine - INFO - Saving checkpoint at 1200 epochs
02/15 01:25:58 - mmengine - INFO - Evaluating hmean-iou...
02/15 01:25:58 - mmengine - INFO - prediction score threshold: 0.30, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 01:25:58 - mmengine - INFO - prediction score threshold: 0.40, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 01:25:58 - mmengine - INFO - prediction score threshold: 0.50, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 01:25:58 - mmengine - INFO - prediction score threshold: 0.60, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 01:25:58 - mmengine - INFO - prediction score threshold: 0.70, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 01:25:58 - mmengine - INFO - prediction score threshold: 0.80, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 01:25:58 - mmengine - INFO - prediction score threshold: 0.90, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 01:25:58 - mmengine - INFO - Epoch(val) [1200][1/1]  icdar/precision: 0.0000  icdar/recall: 0.0000  icdar/hmean: 0.0000

Also i am not sure how i can log the loss and how to get rid of the 02/15 01:25:56 - mmengine - INFO - Exp name: textsnake_20230215_011935 part in the logging.

Additional information

Related to https://github.com/open-mmlab/mmocr/issues/1661

gaotongxiao commented 1 year ago

Have you trained models on the full training dataset? I'm sure that the training is not broken, but the choices of hyperparameters matter.

Take FCENet(configs/textdet/fcenet/fcenet_resnet50_fpn_1500e_icdar2015.py) as an example. The default learning rate is 1e-3, which works best when batch size is 8. But in your case, the actual batch size is 1, so empirically you can apply the linear scaling rule and reduce the learning rate to 1e-3 / 8 to stabilize the training.

The parameter scheduler will reduce the learning rate over time, which is not that useful for overfitting, so you may comment out this section.

# param_scheduler = [
#     dict(type='PolyLR', power=0.9, eta_min=1e-7, end=1500),
# ]

Besides, the default logging interval 5 is not applicable to your one-sample case, you can add the following snippet to the config:

default_hooks = dict(
    logger=dict(type='LoggerHook', interval=1),
)

so that you should be able to see how the loss changes over time. FYI, here is a part of my log:

02/15 10:28:57 - mmengine - INFO - Exp name: fcenet_resnet50_fpn_1500e_icdar2015_20230215_102404
02/15 10:28:57 - mmengine - INFO - Epoch(train) [1099][1/1]  lr: 1.2500e-04  eta: 0:01:40  time: 0.2516  data_time: 0.0614  memory: 1470  loss: 3.9647  loss_text: 0.4209  loss_center: 1.7943  loss_reg_x: 0.8896  loss_reg_y: 0.8599
02/15 10:28:57 - mmengine - INFO - Exp name: fcenet_resnet50_fpn_1500e_icdar2015_20230215_102404
02/15 10:28:57 - mmengine - INFO - Epoch(train) [1100][1/1]  lr: 1.2500e-04  eta: 0:01:40  time: 0.2493  data_time: 0.0599  memory: 1469  loss: 3.7024  loss_text: 0.4035  loss_center: 1.7274  loss_reg_x: 0.8633  loss_reg_y: 0.7081
02/15 10:28:57 - mmengine - INFO - Epoch(val) [1100][1/1]    eta: 0:00:00  time: 0.2302  data_time: 0.0241  memory: 922  
02/15 10:28:57 - mmengine - INFO - Evaluating hmean-iou...
02/15 10:28:57 - mmengine - INFO - prediction score threshold: 0.30, recall: 0.4000, precision: 0.5000, hmean: 0.4444

02/15 10:28:57 - mmengine - INFO - prediction score threshold: 0.40, recall: 0.4000, precision: 0.6667, hmean: 0.5000

02/15 10:28:57 - mmengine - INFO - prediction score threshold: 0.50, recall: 0.4000, precision: 1.0000, hmean: 0.5714

02/15 10:28:57 - mmengine - INFO - prediction score threshold: 0.60, recall: 0.4000, precision: 1.0000, hmean: 0.5714

02/15 10:28:57 - mmengine - INFO - prediction score threshold: 0.70, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 10:28:57 - mmengine - INFO - prediction score threshold: 0.80, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 10:28:57 - mmengine - INFO - prediction score threshold: 0.90, recall: 0.0000, precision: 0.0000, hmean: 0.0000

02/15 10:28:57 - mmengine - INFO - Epoch(val) [1100][1/1]  icdar/precision: 1.0000  icdar/recall: 0.4000  icdar/hmean: 0.5714
jturner116 commented 1 year ago

@gaotongxiao I am experiencing the same problem with oclip specifically, and not with dcnv2 for dbnetpp. Could you try reproducing with dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015 ?

gaotongxiao commented 1 year ago

@jturner116 We have experienced such instability when training these oclip variants. Their performance can be really sensitive to all the randomness factors - that is, you may fix this issue by just retrying a few more times. You may also refer to our logs for a quick reference. https://download.openmmlab.com/mmocr/textdet/dbnet/dbnet_resnet50-oclip_1200e_icdar2015/20221102_115917.log

jturner116 commented 1 year ago

Thank you @gaotongxiao , I am glad it is not just me :D