open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.34k stars 749 forks source link

[Bug] DATASET PREPARER for mjsynth dataset is not working #1896

Open MichaelChao02 opened 1 year ago

MichaelChao02 commented 1 year ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmocr

I'm a little confused about the branch here. I followed the dev1.x insatllation guide but wasn't required to change the branch.

Environment

sys.platform: linux
Python: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0]
CUDA available: False
numpy_random_seed: 2147483648
GCC: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
PyTorch: 2.0.0+cu117
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.15.1+cu117
OpenCV: 4.7.0
MMEngine: 0.7.3
MMOCR: 1.0.0+unknown

Reproduces the problem - code sample

There is no customized code.

Reproduces the problem - command or script

python tools/dataset_converters/prepare_dataset.py mjsynth --task textrecog --lmdb

Reproduces the problem - error message

Written 260000 / 8919273
Written 261000 / 8919273
Written 262000 / 8919273
Traceback (most recent call last):
  File "/gpfs/projects/ZhuGroup/dev1/tools/dataset_converters/prepare_dataset.py", line 153, in <module>
    main()
  File "/gpfs/projects/ZhuGroup/dev1/tools/dataset_converters/prepare_dataset.py", line 149, in main
    preparer.run(args.splits)
  File "/gpfs/projects/ZhuGroup/dev1/mmocr/datasets/preparers/data_preparer.py", line 85, in run
    self.loop(split, getattr(self, f'{split}_preparer'))
  File "/gpfs/projects/ZhuGroup/dev1/mmocr/datasets/preparers/data_preparer.py", line 179, in loop
    dumper(samples)
  File "/gpfs/projects/ZhuGroup/dev1/mmocr/datasets/preparers/dumpers/base.py", line 32, in __call__
    self.dump(data)
  File "/gpfs/projects/ZhuGroup/dev1/mmocr/datasets/preparers/dumpers/lmdb_dumper.py", line 124, in dump
    if not self.check_image_is_valid(image_bin):
  File "/gpfs/projects/ZhuGroup/dev1/mmocr/datasets/preparers/dumpers/lmdb_dumper.py", line 59, in check_image_is_valid
    img = cv2.imdecode(imageBuf, cv2.IMREAD_GRAYSCALE)
cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgcodecs/src/loadsave.cpp:798: error: (-215:Assertion failed) !buf.empty() in function 'imdecode_'

Additional information

I downloaded the MjSynth data using academic torrents as the http connection is very slow. The only potential problem I can think of is that the website says the file size is 10.68GB but the file I downloaded is only 9.95 GB. I tried to download multiple times but the results are the same. (If they use 1GB=1000KB to convert the unit, then it makes sense) Once I tried to convert the data to lmdb format, it showed the error message when writing 262000 / 8919273. I tried to do this on multiple devices, and the error pops up at the exact same place. I cannot figure out what causes the problem.

If someone can run the code to competition, maybe he/she can provide me with:

  1. the hash of mjsynth.tar.gz so I can make suse I'm using the right file.
  2. send me the converted lmdb file and config file

so that I can further inspect the causes.

gaotongxiao commented 1 year ago

It happens due to some broken images in MJSynth. A workaround is to set the verify flag to False to skip the verification process. https://github.com/open-mmlab/mmocr/blob/d56155c82df3b0a4e859b692acc7fd9a26d760d3/mmocr/datasets/preparers/dumpers/lmdb_dumper.py#L46

But eventually we need to pre-check if the image is empty at the beginning of this method to prevent the fatal error: https://github.com/open-mmlab/mmocr/blob/d56155c82df3b0a4e859b692acc7fd9a26d760d3/mmocr/datasets/preparers/dumpers/lmdb_dumper.py#L55

EomSooHwan commented 1 year ago

It happens due to some broken images in MJSynth. A workaround is to set the verify flag to False to skip the verification process.

https://github.com/open-mmlab/mmocr/blob/d56155c82df3b0a4e859b692acc7fd9a26d760d3/mmocr/datasets/preparers/dumpers/lmdb_dumper.py#L46

But eventually we need to pre-check if the image is empty at the beginning of this method to prevent the fatal error:

https://github.com/open-mmlab/mmocr/blob/d56155c82df3b0a4e859b692acc7fd9a26d760d3/mmocr/datasets/preparers/dumpers/lmdb_dumper.py#L55

Hello, I am having the same issue here. Is there a specific way to pre-check if the image is missing or broken, as you said? For example, is it able to check if imageBuf is not valid input for cv2.imdecode so that the verifying part does not break?