tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.76k stars 9.35k forks source link

Tesseract 4.0 hangs when processing a particular image #2288

Open lewislun opened 5 years ago

lewislun commented 5 years ago

Environment

Current Behavior:

hangs when running the following command: tesseract failed-image.jpeg output.txt

output message:

Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 207

Tesseract does not stop nor give any message after that. other images work fine, i only have trouble processing this particular image. I have found that the image after processed by tesseract (or leptonica?) is weird, dont know if it is related.

failed-image.jpeg: https://drive.google.com/open?id=1HsgCbtuNpgf_XxzjkekXU9-uuiWDsV0H tessinput.tif: https://drive.google.com/open?id=1sE8Nn5rykSWPT6PMF3nFSonPMT9y-H61

Expected Behavior:

Tesseract should either give an error message or finish ocr on the image even if the image quality is bad.

zdenop commented 5 years ago
  1. Your tesseract version is outdated.
  2. jpeg is not suitable format for OCR (jpeg compression artifacts)
  3. Your input is not suitable for tesseract binarization (Otsu) algorithm (result you see in tessinput.tif). Did you read ImproveQuality wiki?
stweil commented 5 years ago

The problem also exists with latest code. This might be another example for issue #2196.

stweil commented 5 years ago

Tesseract hangs in an endless loop here:

(gdb) i s
#0  tesseract::ColPartitionGrid::FindPartitionPartners (this=0x555557d2ea90) at ../../../../../src/textord/colpartitiongrid.cpp:1190
#1  0x00005555555ffdc0 in tesseract::ColumnFinder::FindBlocks (this=0x555557d2e950, pageseg_mode=tesseract::PSM_AUTO, scaled_color=0x0, scaled_factor=-1, 
    input_block=0x555557d1ed60, photo_mask_pix=0x5555559592d0, thresholds_pix=0x555555958550, grey_pix=0x5555559585a0, pixa_debug=0x7ffff69159f0, blocks=0x7fffffffd130, 
    diacritic_blobs=0x7fffffffd208, to_blocks=0x7fffffffd210) at ../../../../../src/textord/colfind.cpp:432
#2  0x00005555555ca938 in tesseract::Tesseract::AutoPageSeg (this=0x7ffff68f2010, pageseg_mode=tesseract::PSM_AUTO, blocks=0x555555955720, to_blocks=0x7fffffffd210, 
    diacritic_blobs=0x7fffffffd208, osd_tess=0x0, osr=0x7fffffffd5d0) at ../../../../../src/ccmain/pagesegmain.cpp:226
#3  0x00005555555ca4d7 in tesseract::Tesseract::SegmentPage (this=0x7ffff68f2010, input_file=0x55555595ed90, blocks=0x555555955720, osd_tess=0x0, osr=0x7fffffffd5d0)
    at ../../../../../src/ccmain/pagesegmain.cpp:139
#4  0x0000555555584380 in tesseract::TessBaseAPI::FindLines (this=0x5555558ce280 <main::api>) at ../../../../../src/api/baseapi.cpp:2090
#5  0x000055555557f7cd in tesseract::TessBaseAPI::Recognize (this=0x5555558ce280 <main::api>, monitor=0x0) at ../../../../../src/api/baseapi.cpp:835
#6  0x0000555555580fa6 in tesseract::TessBaseAPI::ProcessPage (this=0x5555558ce280 <main::api>, pix=0x5555559583c0, page_index=0, 
    filename=0x7fffffffe744 "issue/2288/failed-image.jpeg", retry_config=0x0, timeout_millisec=0, renderer=0x555555950840) at ../../../../../src/api/baseapi.cpp:1228
#7  0x0000555555580d3a in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x5555558ce280 <main::api>, filename=0x7fffffffe744 "issue/2288/failed-image.jpeg", 
    retry_config=0x0, timeout_millisec=0, renderer=0x555555950840) at ../../../../../src/api/baseapi.cpp:1186
#8  0x00005555555806d1 in tesseract::TessBaseAPI::ProcessPages (this=0x5555558ce280 <main::api>, filename=0x7fffffffe744 "issue/2288/failed-image.jpeg", retry_config=0x0, 
    timeout_millisec=0, renderer=0x555555950840) at ../../../../../src/api/baseapi.cpp:1076
#9  0x000055555557b3ae in main (argc=3, argv=0x7fffffffe498) at ../../../../../src/api/tesseractmain.cpp:745

Issue #2196 has a different stack, so it looks like we have two issues with images causing an endless loop in the layout detection.

zdenop commented 5 years ago

Yes, endless loop is problem - that is the reason I keep issue open. But points 2. and 3. can help to avoid problem or if there is no issue with endless loop, OCR will not produce expected results.

amitdo commented 5 years ago

The main issue here is Tesseract's binarization method.

I used GIMP's thresholding (60-255) to produce this image.

i2288-bin-60-255

output with best:


Great Daddy, 2014 ELE

Acrylic on canvas
200 x 300 cm

Error during processing.
zdenop commented 5 years ago

@amitdo : I do not think the main issue is Tesseract's binarization method... It works good in most of cases (see e.g. 2264) - but not it all. I expect if we replace it with something else, we will get similar reports with other kind of images.

Anyway patch for automatic selection best of binarization algorithm is welcomed ;-)

And of course infinite loop in tesseract should be fixed too.

stweil commented 5 years ago

Automatic selection would be great, but a first step could be to offer some binarization algorithms, so the user has a choice (command line option or config parameter).

chintler commented 4 years ago

I'm facing this issue too. Are there any updates or workarounds that I can try, including what @stweil suggested?

Ra-Na commented 4 years ago

Same here. Ubuntu 18.04, tesseract 4.0.0-beta.1.

Ra-Na commented 4 years ago

On Ubuntu 18.04.3 Tesseract is updated to version 4.1.1, the issue is gone (in my case). The issue is gone in Tesseract 4.1.1. You have to install it manually. For Ubuntu 18.04 users, simply

sudo add-apt-repository ppa:alex-p/tesseract-ocr sudo apt-get update sudo apt upgrade

(Details here)

amitdo commented 4 years ago

@lewislun, was this issue solved for your case with version 4.1.1 or the current code in the master branch?

jcrogel commented 4 years ago

I am still seeing this on 4.1.1 and png files

zdenop commented 4 years ago

@jcrogel: without image, that can help to find problem you comment is useless.

saikalyan9981 commented 3 years ago

I'm trying to use "Tesseract Open Source OCR Engine v4.1.1-rc2-20-g01fb with Leptonica" on the following Image It's stuck. @zdenop can you help with this and suggest any workaround? As of now with --oem 0 (legacy) it's working fine

Shreeshrii commented 3 years ago

@saikalyan9981 Works fine with current code from repo. Time taken is different based on the traineddata file being used.

(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata_best
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m33.252s
user    1m47.232s
sys     0m0.826s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata_fast
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m12.468s
user    0m30.834s
sys     0m0.593s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m18.681s
user    0m53.303s
sys     0m0.714s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 0
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.286s
user    0m54.827s
sys     0m0.696s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 1
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m18.088s
user    0m51.650s
sys     0m0.760s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 2
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.176s
user    0m54.583s
sys     0m0.744s
(base) ubuntu@tesseract-ocr-1:~/TEST$ time tesseract 2288.png output  --tessdata-dir ~/tessdata --oem 3
Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-172-gf3cf with Leptonica

real    0m19.216s
user    0m54.951s
sys     0m0.682s
saikalyan9981 commented 3 years ago

@Shreeshrii Thanks a lot, I'll use v5.0.0. I think the issue is with v4.1.1

Ra-Na commented 3 years ago

I just ran

tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

(output of tesseract --version)

on the above image without any issues.

amitdo commented 3 years ago

With the code from #3418, the processing ends after 7 seconds, when Sauvola binarization is used, but the output is garbage.