open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.27k stars 743 forks source link

ZeroDivisionError: float division by zero when evaluating or testing ICDAR2017-MLT on fcenet_r50dcnv2_fpn_1500e_ctw1500.py #447

Closed liuqc11 closed 3 years ago

liuqc11 commented 3 years ago

I tried to train FCENet with ICDAR2017-MLT, and I revised the config file fcenet_r50dcnv2_fpn_1500e_ctw1500.py for training. the config file is as follows. It did well in the first 20 epoches.

fourier_degree = 5
model = dict(
    type='FCENet',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(1, 2, 3),
        frozen_stages=-1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        dcn=dict(type='DCNv2', deform_groups=2, fallback_on_stride=False),
#         init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50'),
        stage_with_dcn=(False, True, True, True)),
    neck=dict(
        type='FPN',
        in_channels=[512, 1024, 2048],
        out_channels=256,
        add_extra_convs='on_output',
        num_outs=3,
        relu_before_extra_convs=True,
        act_cfg=None),
    bbox_head=dict(
        type='FCEHead',
        in_channels=256,
        scales=(8, 16, 32),
        loss=dict(type='FCELoss'),
        text_repr_type='quad',
        fourier_degree=fourier_degree,
    ))

train_cfg = None
test_cfg = None

dataset_type = 'IcdarDataset'
data_root = '/root/ftp/icdar2017-mlt'

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

train_pipeline = [
    dict(type='LoadImageFromFile', color_type='color_ignore_orientation'),
    dict(
        type='LoadTextAnnotations',
        with_bbox=True,
        with_mask=True,
        poly2mask=False),
    dict(
        type='ColorJitter',
        brightness=32.0 / 255,
        saturation=0.5,
        contrast=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='RandomScaling', size=800, scale=(3. / 4, 5. / 2)),
    dict(
        type='RandomCropFlip', crop_ratio=0.5, iter_num=1, min_area_ratio=0.2),
    dict(
        type='RandomCropPolyInstances',
        instance_key='gt_masks',
        crop_ratio=0.8,
        min_side_ratio=0.3),
    dict(
        type='RandomRotatePolyInstances',
        rotate_ratio=0.5,
        max_angle=30,
        pad_with_fixed_color=False),
    dict(type='SquareResizePad', target_size=800, pad_ratio=0.6),
    dict(type='RandomFlip', flip_ratio=0.5, direction='horizontal'),
    dict(type='Pad', size_divisor=32),
    dict(
        type='FCENetTargets',
        fourier_degree=fourier_degree,
        level_proportion_range=((0, 0.25), (0.2, 0.65), (0.55, 1.0))),
    dict(
        type='CustomFormatBundle',
        keys=['p3_maps', 'p4_maps', 'p5_maps'],
        visualize=dict(flag=False, boundary_key=None)),
    dict(type='Collect', keys=['img', 'p3_maps', 'p4_maps', 'p5_maps'])
]
test_pipeline = [
    dict(type='LoadImageFromFile', color_type='color_ignore_orientation'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2260, 2260),
        flip=False,
        transforms=[
            dict(type='Resize', img_scale=(1280, 800), keep_ratio=True),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=20,
    workers_per_gpu=5,
    val_dataloader=dict(samples_per_gpu=1),
    test_dataloader=dict(samples_per_gpu=1),
    train=dict(
        type=dataset_type,
        ann_file=data_root + '/instances_training.json',
        img_prefix=data_root + '/imgs',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + '/instances_val.json',
        img_prefix=data_root + '/imgs',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + '/instances_val.json',
        img_prefix=data_root + '/imgs',
        pipeline=test_pipeline))
evaluation = dict(interval=5, metric='hmean-iou')

# optimizer
optimizer = dict(type='SGD', lr=1e-3, momentum=0.90, weight_decay=5e-4)
optimizer_config = dict(grad_clip=None)
lr_config = dict(policy='poly', power=0.9, min_lr=1e-7, by_epoch=True)
total_epochs = 500

checkpoint_config = dict(interval=5)
# yapf:disable
log_config = dict(
    interval=20,
    hooks=[
        dict(type='TextLoggerHook')

    ])
# yapf:enable
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = 'work_dir/fce/latest.pth'
workflow = [('train', 1)]

Describe the bug and Error traceback When evaluating the validation datasets on ICDAR2017-MLT,sometimes it occurs the ZeroDivisionError: float division by zero error. The error log is as follows.

[>>>>>>>>>>>>>>>>>                                 ] 633/1800, 5.2 task/s, elapsed: 122s, ETA:   226sTraceback (most recent call last):
  File "./tools/train.py", line 221, in <module>
    main()
  File "./tools/train.py", line 217, in main
    meta=meta)
  File "/root/mmocr/mmocr/apis/train.py", line 162, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/root/.local/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/root/.local/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
    self.call_hook('after_train_epoch')
  File "/root/.local/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/root/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 237, in after_train_epoch
    self._do_evaluate(runner)
  File "/root/.local/lib/python3.7/site-packages/mmdet/core/evaluation/eval_hooks.py", line 17, in _do_evaluate
    results = single_gpu_test(runner.model, self.dataloader, show=False)
  File "/root/.local/lib/python3.7/site-packages/mmdet/apis/test.py", line 27, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/root/.local/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/.local/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 42, in forward
    return super().forward(*inputs, **kwargs)
  File "/root/.local/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/root/.local/conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/.local/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/root/.local/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 173, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/root/.local/lib/python3.7/site-packages/mmdet/models/detectors/base.py", line 146, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/root/mmocr/mmocr/models/textdet/detectors/fcenet.py", line 34, in simple_test
    boundaries = self.bbox_head.get_boundary(outs, img_metas, rescale)
  File "/root/mmocr/mmocr/models/textdet/dense_heads/fce_head.py", line 112, in get_boundary
    score_map, scale)
  File "/root/mmocr/mmocr/models/textdet/dense_heads/fce_head.py", line 138, in _get_boundary_single
    nms_thr=self.nms_thr)
  File "/root/mmocr/mmocr/models/textdet/postprocess/wrapper.py", line 35, in decode
    return fcenet_decode(**kwargs)
  File "/root/mmocr/mmocr/models/textdet/postprocess/wrapper.py", line 467, in fcenet_decode
    polygons = poly_nms(np.hstack((polygons, score)).tolist(), nms_thr)
  File "/root/mmocr/mmocr/models/textdet/postprocess/wrapper.py", line 502, in poly_nms
    iou_list[i] = boundary_iou(A, B)
  File "/root/mmocr/mmocr/core/evaluation/utils.py", line 192, in boundary_iou
    return poly_iou(src_poly, target_poly)
  File "/root/mmocr/mmocr/core/evaluation/utils.py", line 209, in poly_iou
    return area_inters / poly_union(poly_det, poly_gt)
ZeroDivisionError: float division by zero

Reproduction When the training process is broken, I tried to test on the latest epoth checkpoint, and I can reproduct the bug with the following command.

./tools/dist_test.sh configs/textdet/fcenet/fcenet_r50dcnv2_fpn_1500e_ctw1500.py work_dir/fce/latest.pth 1 --eval hmean-iou

Environment

My environment is as follows.

sys.platform: linux
Python: 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
CUDA available: True
GPU 0: Tesla V100-PCIE-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0
OpenCV: 4.5.1
MMCV: 1.3.11
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.1
MMOCR: 0.2.1+b8f7ead

Bug fix I have tried to read the code in mmocr/mmocr/core/evaluation/utils.py, and I guess the bug occurs because the utils.py does not consider when poly_union(poly_det, poly_gt) is zero, it is because in the config file test_pipeline,

test_pipeline = [
    dict(type='LoadImageFromFile', color_type='color_ignore_orientation'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2260, 2260),
        flip=False,
        transforms=[
            dict(type='Resize', img_scale=(1280, 800), keep_ratio=True),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]

The so called 'MultiScaleFlipAug' (Maybe inherited from MMDetection), can see in the Source code for mmdet.datasets.pipelines.test_time_aug does not consider the situation that sometimes Random crop error: ZeroDivisionError: float division by zero, if I revise the img_scale with a larger number in the 'MultiScaleFlipAug' config, sometimes it works. I guess it works because the image size in the eval datasets of ICDAR2017-MLT varies from 300*104 to 4128*3096, if the img_scale is too small, the 'MultiScaleFlipAug' might produce a picture with no text for evalutating or testing? But I suggest to revise the mmocr/mmocr/core/evaluation/utils.py file to go pass those poly_union(poly_det, poly_gt) zero

gaotongxiao commented 3 years ago

Thanks for reporting that! We're looking into this problem and will fix it asap.

liuqc11 commented 3 years ago

fix bug

I revised the mmocr/mmocr/models/textdet/postprocess/wrapper.py file line 502 as follows:

iou_list = np.zeros((len(index), ))
for i in range(len(index)):
     B = polygons[index[i]][:-1]
     try:
         iou_list[i] = boundary_iou(A, B)
     except ZeroDivisionError:
         iou_list[i] = threshold + 0.01
         continue
remove_index = np.where(iou_list > threshold)
index = np.delete(index, remove_index)

and temporarily solve the problem

gaotongxiao commented 3 years ago

The major cause of the problem is the unstable model - sometimes it generates invalid polygons whose area could be 0. Scaling the images somehow randomly alleviates the problem. I'll add a zero division handler to boundary_iou which seems to be a better solution. Thank you again for the insights!

gaotongxiao commented 3 years ago

I'm still curious about what's the model's output though, as there can be many reasons zeroing the areas of polygons. Would you mind printing out A and B in your ZeroDivisionError handler and share the output with us? Thanks!

liuqc11 commented 3 years ago

I'm still curious about what's the model's output though, as there can be many reasons zeroing the areas of polygons. Would you mind printing out A and B in your ZeroDivisionError handler and share the output with us? Thanks!

No problem, I just test the eval datasets of ICDAR2017-MLT with epoch_55.pth, I simply reproduce with the

type='MultiScaleFlipAug',
        img_scale=(2260, 2260),

the output log when the ZeroDivisionError occurs is as follows.

loading annotations into memory...
Done (t=0.25s)
creating index...
index created!
Use load_from_local loader
[>>>>>>>>>>>>>>>>>                                 ] 633/1800, 1.9 task/s, elapsed: 336s, ETA:   620sA is  [136. 834. 136. 834. 136. 835. 137. 835. 138. 835. 138. 835. 139. 834.
 140. 834. 141. 833. 142. 832. 142. 832. 142. 832. 142. 833. 142. 834.
 141. 835. 141. 835. 140. 835. 140. 835. 139. 835. 139. 834. 138. 834.
 138. 833. 137. 833. 137. 834. 136. 834. 136. 834. 136. 834. 136. 833.
 135. 833. 134. 833. 133. 833. 132. 833. 131. 834. 130. 834. 129. 835.
 129. 835. 129. 835. 129. 834. 130. 833. 130. 833. 131. 832. 131. 832.
 132. 832. 132. 833. 133. 834. 133. 834. 134. 835. 134. 835. 135. 834.
 135. 834.]
B is  [133. 834. 133. 834. 133. 835. 133. 836. 134. 836. 135. 836. 136. 835.
 137. 834. 138. 834. 138. 833. 139. 833. 138. 833. 138. 834. 137. 835.
 137. 836. 137. 836. 136. 836. 136. 836. 136. 836. 136. 835. 135. 835.
 135. 835. 134. 835. 134. 835. 133. 835. 133. 835. 133. 835. 133. 835.
 133. 834. 132. 834. 131. 834. 130. 835. 129. 836. 128. 837. 128. 837.
 128. 837. 128. 837. 129. 836. 130. 834. 130. 833. 130. 833. 131. 833.
 131. 833. 131. 833. 131. 834. 131. 834. 132. 834. 132. 834. 133. 834.
 133. 833.]
liuqc11 commented 3 years ago

I calculate the area of A and B using Polygon A.area() = 0.5 B.area() = 1.5 intersection_A_B = 2.0 and union_A_B = A.area() + B.area() - intersection_A_B = 0

gaotongxiao commented 3 years ago

Thanks so much for the feedback! Both polygons have self-intersections and thus are invalid, which should have been handled. I'll fix this in the PR soon.