Open hzhuangdy opened 2 years ago
You can use a larger validation-time batch size in your config by modifying samples_per_gpu
in val_dataloader
:
data = dict(
val_dataloader=dict(samples_per_gpu=1),
Read it for more details.
Thanks for reply
I have set the batch_size to 4 and it still take a long time (at least 10m per IoU thresh)
I see. It's not a matter of batch size. Does your dataset have a large number of polygons for each image? The implementation of hmean iou metric in MMOCR 0.x is known to have a performance issue since it conducts redundant computation per threshold. MMOCR 1.0 has significantly optimized its implementation with vectorization and full utilization of intermediate results (in hmean_iou_metric.py). You're welcome to upgrade to MMOCR 1.0 to enjoy all the improvements there.
I tried both version and they perform the same. My test data indeed has almost 30+ polygons per image, but the key is that it seems like the test process did not use GPU for inference. It occupies 4GB GPU memory but the GPU utility is almost 0
I don't think GPU was not used at inference time, but CPU was most likely to be the bottleneck when you were using some CPU-intensive models (e.g. DRRG). You can check your model's implementation and see if it behaves differently at training and testing.
I have set the batch_size to 4 and it still take a long time (at least 10m per IoU thresh)
But from your description, I still feel the old evaluation metric is still something to blame. At least, the implementation in MMOCR 1.0 won't perform redundant computations for each IoU threshold. It computes only once and the results per threshold are available instantly, and therefore a significant amount of time can be saved if each IoU threshold took 10 minutes in MMOCR 0.x.
OK, I will try it again later. By the way, I use DBNet++ and I found that on 1.x branch the EastRandomCrop op is replaced by a normal RandomCrop op. I wonder if this replacement would affect the behaviour of DBNet++. And why would you do this? Also, I can't find a way to change the unclip ratio in Config files.
I use DBNet++ and I found that on 1.x branch the EastRandomCrop op is replaced by a normal RandomCrop op. I wonder if this replacement would affect the behaviour of DBNet++.
We have answered it here: https://github.com/open-mmlab/mmocr/issues/1443
Also, I can't find a way to change the unclip ratio in Config files.
Unclip ratio has been moved to DBPostprocessor: https://github.com/open-mmlab/mmocr/blob/52a787397363835bf1dd3bfe12999b7aadeb7e49/mmocr/models/textdet/postprocessors/db_postprocessor.py#L34 which can be modified in this line: https://github.com/open-mmlab/mmocr/blob/53562d8526cebf05c2850e12805c00bd586f7c61/configs/textdet/dbnetpp/_base_dbnetpp_resnet50-dcnv2_fpnc.py#L24
Thanks
Is there any method to add a progressbar(or logger) inside the evaluation? I upgrade to MMOCR1.x and run dist_test.sh. And it stuck at the following line. local loads checkpoint from path: work_dirs/dbnetpp_resnet50-dcnv2_fpnc_120e_icdar2015/epoch_20.pth local loads checkpoint from path: work_dirs/dbnetpp_resnet50-dcnv2_fpnc_120e_icdar2015/epoch_20.pth 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.0.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.1.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.2.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.3.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.0.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.1.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.2.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.3.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.4.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.5.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.0.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.1.conv2 is upgraded to version 2. 10/21 14:03:16 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.2.conv2 is upgraded to version 2. The model and loaded state dict do not match exactly unexpected key in source state_dict: data_preprocessor.mean, data_preprocessor.std 10/21 14:03:16 - mmengine - INFO - Load checkpoint from work_dirs/dbnetpp_resnet50-dcnv2_fpnc_120e_icdar2015/epoch_20.pth /mmocr/utils/polygon_utils.py:326: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. result = np.array(pco.Execute(distance))
Currently not. But now the progress will be logged as in training. It usually looks like:
10/21 14:19:50 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used.
local loads checkpoint from path: 8729.pth
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.0.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.1.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.2.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer2.3.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.0.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.1.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.2.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.3.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.4.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer3.5.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.0.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.1.conv2 is upgraded to version 2.
10/21 14:19:51 - mmengine - INFO - ModulatedDeformConvPack backbone.layer4.2.conv2 is upgraded to version 2.
The model and loaded state dict do not match exactly
unexpected key in source state_dict: pixel_mean, pixel_std, neck.asf_conv.conv.weight, neck.asf_conv.conv.bias, neck.asf_attn.channel_wise.0.conv.weight, neck.asf_attn.channel_wise.1.conv.weight, neck.asf_attn.spatial_wise.0.conv.weight, neck.asf_attn.spatial_wise.1.conv.weight, neck.asf_attn.attention_wise.conv.weight
10/21 14:19:51 - mmengine - INFO - Load checkpoint from 8729.pth
10/21 14:19:53 - mmengine - INFO - Epoch(test) [5/500] memory: 2261
10/21 14:19:54 - mmengine - INFO - Epoch(test) [10/500] eta: 0:02:10 time: 0.2654 data_time: 0.0150 memory: 715
10/21 14:19:55 - mmengine - INFO - Epoch(test) [15/500] eta: 0:02:10 time: 0.1721 data_time: 0.0014 memory: 715
10/21 14:19:56 - mmengine - INFO - Epoch(test) [20/500] eta: 0:01:24 time: 0.1754 data_time: 0.0018 memory: 715
10/21 14:19:56 - mmengine - INFO - Epoch(test) [25/500] eta: 0:01:24 time: 0.1757 data_time: 0.0019 memory: 71
...
If you haven't modified the logging interval, then your log basically means that your model did not even finish the first 5 batches of inference.
I'm not sure if it's because of the model or the evaluation metric. MMOCR 1.0 does some pre-calculation at the end of each inference epoch, which might still be a bottleneck when the number of gt_polygons * pred_polygons
is huge. To identity the reason, could you add the following code snipnet after line 87 of https://github.com/open-mmlab/mmocr/blob/9bd5258513a9afcf8708799529c315d9c4f2bc2d/mmocr/evaluation/metrics/hmean_iou_metric.py#L88
import time
start = time.time()
and the following to the end of process()
method, that is, after line 124:
https://github.com/open-mmlab/mmocr/blob/9bd5258513a9afcf8708799529c315d9c4f2bc2d/mmocr/evaluation/metrics/hmean_iou_metric.py#L124
end = time.time()
print('Elapsed time:', end - start)
and report the time to us?
If the metric is really the bottleneck, then it would be helpful to let us know the scale of your data - gt_num
and pred_num
https://github.com/open-mmlab/mmocr/blob/9bd5258513a9afcf8708799529c315d9c4f2bc2d/mmocr/evaluation/metrics/hmean_iou_metric.py#L111
For the normal situations:
Elapsed time: 0.21742606163024902 gt_num:42, pred_num:42 Elapsed time: 0.20623540878295898 gt_num:28, pred_num:28 Elapsed time: 0.0869746208190918 gt_num:28, pred_num:28 Elapsed time: 0.11585068702697754 gt_num:28, pred_num:28 Elapsed time: 0.09441208839416504 gt_num:22, pred_num:22 Elapsed time: 0.0568392276763916 gt_num:29, pred_num:29 Elapsed time: 0.09694695472717285 gt_num:29, pred_num:29 Elapsed time: 0.11534810066223145 gt_num:36, pred_num:36 Elapsed time: 0.1600325107574463 gt_num:172, pred_num:56 Elapsed time: 1.128920555114746 gt_num:24, pred_num:24 Elapsed time: 0.06359171867370605 gt_num:54, pred_num:53 Elapsed time: 0.30768823623657227 gt_num:36, pred_num:37 Elapsed time: 0.14965462684631348 gt_num:72, pred_num:77 Elapsed time: 0.6637179851531982 gt_num:30, pred_num:30 Elapsed time: 0.1144106388092041 gt_num:7, pred_num:8 Elapsed time: 0.008452177047729492 gt_num:42, pred_num:42 Elapsed time: 0.21057415008544922 gt_num:93, pred_num:93 Elapsed time: 1.2866272926330566 gt_num:92, pred_num:92
But for some conditions:
gt_num:380, pred_num:43 Elapsed time: 1.8887600898742676 gt_num:252, pred_num:103 Elapsed time: 2.893418788909912
if the gt_num and pred_num are up to 100 polygons, then it costs 2s per image. For gt has 300+ and pred has 300+ polygons, it takes almost 10s per image
Thanks for your insight. I feel like the IoU computation can be optimized with some tricks such as parallelization.
Is your dataset publicly accessible? It will be helpful for us to reproduce the issue and improve efficiency asap.
I'm sorry, my datasets are not publicly accessible. However, My datasets are all document images, so I think some public datasets like PubLayNet might help.
I also tried the code from Wenmuzhou(https://github.com/WenmuZhou/DBNet.pytorch), his evaluation runs fast. But I don't know the difference between his and MMOCR's
Thanks for the pointer! We'll look into this implementation.
It seems their implementation outputs quadrangle by default: https://github.com/WenmuZhou/DBNet.pytorch/blob/678b2ae55e018c6c16d5ac182558517a154a91ed/config/open_dataset_resnest50_FPN_DBhead_polyLR.yaml#L25
And the quadrangle evaluation can be optimized to be much faster than multi-point polygon case. We'll try to implement it in MMOCR.
Prerequisite
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmocr
Environment
fatal: not a git repository (or any parent up to mount point /DATA/disk1) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). sys.platform: linux Python: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: Tesla V100-PCIE-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.6, V11.6.124 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 PyTorch: 1.12.0+cu116 PyTorch compiling details: PyTorch built with:
TorchVision: 0.13.0+cu116 OpenCV: 4.6.0 MMCV: 1.6.2 MMCV Compiler: GCC 9.3 MMCV CUDA Compiler: 11.6 MMOCR: 0.6.1+ pytorch 1.12-cu116
Reproduces the problem - code sample
python train.py
Reproduces the problem - command or script
python train.py
Reproduces the problem - error message
Evaluating /instances_test.json with 1535 images now 2022-10-20 16:29:56,870 - mmocr - INFO - Evaluating hmean-iou... 2022-10-20 16:37:27,989 - mmocr - INFO - thr 0.30, recall: 0.408, precision: 0.564, hmean: 0.473
--*** it takes almost 10 minitues for one evaluation. and I notice that the GPU utilities are not changed during the evalutaion.(keep the same as trainging)
Additional information
No response