[Bug] Slow training-time evaluation

hzhuangdy commented 2 years ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version (0.x) or latest version (1.x).

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmocr

Environment

fatal: not a git repository (or any parent up to mount point /DATA/disk1) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). sys.platform: linux Python: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: Tesla V100-PCIE-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.6, V11.6.124 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.12.0+cu116 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.6
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.3.2 (built against CUDA 11.5)
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.0+cu116 OpenCV: 4.6.0 MMCV: 1.6.2 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.6 MMOCR: 0.6.1+ pytorch 1.12-cu116

Reproduces the problem - code sample

python train.py

Reproduces the problem - command or script

python train.py

Reproduces the problem - error message

Evaluating /instances_test.json with 1535 images now 2022-10-20 16:29:56,870 - mmocr - INFO - Evaluating hmean-iou... 2022-10-20 16:37:27,989 - mmocr - INFO - thr 0.30, recall: 0.408, precision: 0.564, hmean: 0.473

--*** it takes almost 10 minitues for one evaluation. and I notice that the GPU utilities are not changed during the evalutaion.(keep the same as trainging)

Additional information

No response

gaotongxiao commented 2 years ago

You can use a larger validation-time batch size in your config by modifying samples_per_gpu in val_dataloader:

data = dict(
    val_dataloader=dict(samples_per_gpu=1),

Read it for more details.

hzhuangdy commented 2 years ago

Thanks for reply

I have set the batch_size to 4 and it still take a long time (at least 10m per IoU thresh)

gaotongxiao commented 2 years ago

I see. It's not a matter of batch size. Does your dataset have a large number of polygons for each image? The implementation of hmean iou metric in MMOCR 0.x is known to have a performance issue since it conducts redundant computation per threshold. MMOCR 1.0 has significantly optimized its implementation with vectorization and full utilization of intermediate results (in hmean_iou_metric.py). You're welcome to upgrade to MMOCR 1.0 to enjoy all the improvements there.

hzhuangdy commented 2 years ago

I tried both version and they perform the same. My test data indeed has almost 30+ polygons per image, but the key is that it seems like the test process did not use GPU for inference. It occupies 4GB GPU memory but the GPU utility is almost 0

gaotongxiao commented 2 years ago

I don't think GPU was not used at inference time, but CPU was most likely to be the bottleneck when you were using some CPU-intensive models (e.g. DRRG). You can check your model's implementation and see if it behaves differently at training and testing.

I have set the batch_size to 4 and it still take a long time (at least 10m per IoU thresh)

But from your description, I still feel the old evaluation metric is still something to blame. At least, the implementation in MMOCR 1.0 won't perform redundant computations for each IoU threshold. It computes only once and the results per threshold are available instantly, and therefore a significant amount of time can be saved if each IoU threshold took 10 minutes in MMOCR 0.x.

hzhuangdy commented 2 years ago

OK, I will try it again later. By the way, I use DBNet++ and I found that on 1.x branch the EastRandomCrop op is replaced by a normal RandomCrop op. I wonder if this replacement would affect the behaviour of DBNet++. And why would you do this? Also, I can't find a way to change the unclip ratio in Config files.

gaotongxiao commented 2 years ago

I use DBNet++ and I found that on 1.x branch the EastRandomCrop op is replaced by a normal RandomCrop op. I wonder if this replacement would affect the behaviour of DBNet++.

We have answered it here: https://github.com/open-mmlab/mmocr/issues/1443

Also, I can't find a way to change the unclip ratio in Config files.

Unclip ratio has been moved to DBPostprocessor: https://github.com/open-mmlab/mmocr/blob/52a787397363835bf1dd3bfe12999b7aadeb7e49/mmocr/models/textdet/postprocessors/db_postprocessor.py#L34 which can be modified in this line: https://github.com/open-mmlab/mmocr/blob/53562d8526cebf05c2850e12805c00bd586f7c61/configs/textdet/dbnetpp/_base_dbnetpp_resnet50-dcnv2_fpnc.py#L24

hzhuangdy commented 2 years ago

Thanks

hzhuangdy commented 2 years ago

Is there any method to add a progressbar(or logger) inside the evaluation? I upgrade to MMOCR1.x and run dist_test.sh. And it stuck at the following line. local loads checkpoint from path: work_dirs/dbnetpp_resnet50-dcnv2_fpnc_120e_icdar2015/epoch_20.pth local loads checkpoint from path: work_dirs/dbnetpp_resnet50-dcnv2_fpnc_120e_icdar2015/epoch_20.pth 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.0.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.1.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.2.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.3.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.0.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.1.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.2.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.3.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.4.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.5.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.0.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.1.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.2.conv2 is upgraded to version 2. The model and loaded state dict do not match exactly unexpected key in source state_dict: data_preprocessor.mean, data_preprocessor.std 10/21 14:03:16 - mmengine - INFO - Load checkpoint from work_dirs/dbnetpp_resnet50-dcnv2_fpnc_120e_icdar2015/epoch_20.pth /mmocr/utils/polygon_utils.py:326: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. result = np.array(pco.Execute(distance))

gaotongxiao commented 2 years ago

Currently not. But now the progress will be logged as in training. It usually looks like:

10/21 14:19:50 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used.
local loads checkpoint from path: 8729.pth
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.0.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.1.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.2.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.3.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.0.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.1.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.2.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.3.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.4.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.5.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.0.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.1.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.2.conv2 is upgraded to version 2.
The model and loaded state dict do not match exactly

unexpected key in source state_dict: pixel_mean, pixel_std, neck.asf_conv.conv.weight, neck.asf_conv.conv.bias, neck.asf_attn.channel_wise.0.conv.weight, neck.asf_attn.channel_wise.1.conv.weight, neck.asf_attn.spatial_wise.0.conv.weight, neck.asf_attn.spatial_wise.1.conv.weight, neck.asf_attn.attention_wise.conv.weight

10/21 14:19:51 - mmengine - INFO - Load checkpoint from 8729.pth
10/21 14:19:53 - mmengine - INFO - Epoch(test) [5/500]    memory: 2261  
10/21 14:19:54 - mmengine - INFO - Epoch(test) [10/500]    eta: 0:02:10  time: 0.2654  data_time: 0.0150  memory: 715  
10/21 14:19:55 - mmengine - INFO - Epoch(test) [15/500]    eta: 0:02:10  time: 0.1721  data_time: 0.0014  memory: 715  
10/21 14:19:56 - mmengine - INFO - Epoch(test) [20/500]    eta: 0:01:24  time: 0.1754  data_time: 0.0018  memory: 715  
10/21 14:19:56 - mmengine - INFO - Epoch(test) [25/500]    eta: 0:01:24  time: 0.1757  data_time: 0.0019  memory: 71
...

If you haven't modified the logging interval, then your log basically means that your model did not even finish the first 5 batches of inference. I'm not sure if it's because of the model or the evaluation metric. MMOCR 1.0 does some pre-calculation at the end of each inference epoch, which might still be a bottleneck when the number of gt_polygons * pred_polygons is huge. To identity the reason, could you add the following code snipnet after line 87 of https://github.com/open-mmlab/mmocr/blob/9bd5258513a9afcf8708799529c315d9c4f2bc2d/mmocr/evaluation/metrics/hmean_iou_metric.py#L88

import time
start = time.time()

and the following to the end of process() method, that is, after line 124: https://github.com/open-mmlab/mmocr/blob/9bd5258513a9afcf8708799529c315d9c4f2bc2d/mmocr/evaluation/metrics/hmean_iou_metric.py#L124

end = time.time()
print('Elapsed time:', end - start)

and report the time to us?

If the metric is really the bottleneck, then it would be helpful to let us know the scale of your data - gt_num and pred_num https://github.com/open-mmlab/mmocr/blob/9bd5258513a9afcf8708799529c315d9c4f2bc2d/mmocr/evaluation/metrics/hmean_iou_metric.py#L111

hzhuangdy commented 2 years ago

For the normal situations:

Elapsed time: 0.21742606163024902 gt_num:42, pred_num:42 Elapsed time: 0.20623540878295898 gt_num:28, pred_num:28 Elapsed time: 0.0869746208190918 gt_num:28, pred_num:28 Elapsed time: 0.11585068702697754 gt_num:28, pred_num:28 Elapsed time: 0.09441208839416504 gt_num:22, pred_num:22 Elapsed time: 0.0568392276763916 gt_num:29, pred_num:29 Elapsed time: 0.09694695472717285 gt_num:29, pred_num:29 Elapsed time: 0.11534810066223145 gt_num:36, pred_num:36 Elapsed time: 0.1600325107574463 gt_num:172, pred_num:56 Elapsed time: 1.128920555114746 gt_num:24, pred_num:24 Elapsed time: 0.06359171867370605 gt_num:54, pred_num:53 Elapsed time: 0.30768823623657227 gt_num:36, pred_num:37 Elapsed time: 0.14965462684631348 gt_num:72, pred_num:77 Elapsed time: 0.6637179851531982 gt_num:30, pred_num:30 Elapsed time: 0.1144106388092041 gt_num:7, pred_num:8 Elapsed time: 0.008452177047729492 gt_num:42, pred_num:42 Elapsed time: 0.21057415008544922 gt_num:93, pred_num:93 Elapsed time: 1.2866272926330566 gt_num:92, pred_num:92

But for some conditions:

gt_num:380, pred_num:43 Elapsed time: 1.8887600898742676 gt_num:252, pred_num:103 Elapsed time: 2.893418788909912

if the gt_num and pred_num are up to 100 polygons, then it costs 2s per image. For gt has 300+ and pred has 300+ polygons, it takes almost 10s per image

gaotongxiao commented 2 years ago

Thanks for your insight. I feel like the IoU computation can be optimized with some tricks such as parallelization.

Is your dataset publicly accessible? It will be helpful for us to reproduce the issue and improve efficiency asap.

hzhuangdy commented 2 years ago

I'm sorry, my datasets are not publicly accessible. However, My datasets are all document images, so I think some public datasets like PubLayNet might help.

hzhuangdy commented 2 years ago

I also tried the code from Wenmuzhou(https://github.com/WenmuZhou/DBNet.pytorch), his evaluation runs fast. But I don't know the difference between his and MMOCR's

gaotongxiao commented 2 years ago

Thanks for the pointer! We'll look into this implementation.

gaotongxiao commented 2 years ago

It seems their implementation outputs quadrangle by default: https://github.com/WenmuZhou/DBNet.pytorch/blob/678b2ae55e018c6c16d5ac182558517a154a91ed/config/open_dataset_resnest50_FPN_DBhead_polyLR.yaml#L25

And the quadrangle evaluation can be optimized to be much faster than multi-point polygon case. We'll try to implement it in MMOCR.

open-mmlab / mmocr