tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.08k stars 9.5k forks source link

Tesseract seemingly stuck #3377

Open MerlijnWajer opened 3 years ago

MerlijnWajer commented 3 years ago

Environment

Current Behavior:

Tesseract hangs, seemingly never finishes

Expected Behavior:

Tesseract doesn't hang and produces output normally

GDB backtrace (interrupted after more than 5 minutes):

merlijn@gentoo-x13 ~/archive/tesseract-src/tesseract $ time TESSDATA_PREFIX=/usr/share/tessdata LD_LIBRARY_PATH=`pwd` LD_LIBRARY_PATH=$LD_LIBARY_PATH:`pwd`/.libs gdb --args ./.libs/tesseract /tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm - hocr
GNU gdb (Gentoo 10.1 vanilla) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./.libs/tesseract...
(gdb) r
Starting program: /home/merlijn/archive/tesseract-src/tesseract/.libs/tesseract /tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm - hocr
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-alpha-20201231-545-g23ed5' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
Estimating resolution as 246
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7ec49b6 in tesseract::CLIST_ITERATOR::forward (this=this@entry=0x7fffffffbec0)
    at src/ccutil/clst.cpp:265
265   return current->data;
(gdb) bt
#0  0x00007ffff7ec49b6 in tesseract::CLIST_ITERATOR::forward (this=this@entry=0x7fffffffbec0)
    at src/ccutil/clst.cpp:265
#1  0x00007ffff7ec4b8d in tesseract::CLIST::add_sorted (this=<optimized out>,
    comparator=comparator@entry=
    0x7ffff7d86b90 <tesseract::SortByBoxLeft<tesseract::ColPartition>(void const*, void const*)>,
    unique=unique@entry=true, new_data=<optimized out>, new_data@entry=0x5555bf4f66b0)
    at src/ccutil/clst.cpp:176
#2  0x00007ffff7e2cdf7 in tesseract::BBGrid<tesseract::ColPartition, tesseract::ColPartition_CLIST, tesseract::ColPartition_C_IT>::InsertBBox (this=this@entry=0x55555b031130, h_spread=h_spread@entry=true,
    v_spread=v_spread@entry=true, bbox=0x5555bf4f66b0) at src/textord/bbgrid.h:551
#3  0x00007ffff7e3f664 in tesseract::ColPartitionGrid::ComputeTotalOverlap (
    this=this@entry=0x5555555aec68, overlap_grid=overlap_grid@entry=0x7fffffffc158)
    at src/textord/colpartitiongrid.cpp:329
#4  0x00007ffff7e71620 in tesseract::StrokeWidth::DetectAndRemoveNoise (this=0x55555558c420,
    pre_overlap=95268, grid_box=..., block=0x55555558bf20, part_grid=0x5555555aec68,
    diacritic_blobs=0x7fffffffc688) at src/textord/strokewidth.cpp:1350
#5  0x00007ffff7e729da in tesseract::StrokeWidth::FindInitialPartitions (
    this=this@entry=0x55555558c420, pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO,
    rerotation=..., find_problems=find_problems@entry=true, block=block@entry=0x55555558bf20,
    diacritic_blobs=diacritic_blobs@entry=0x7fffffffc688, part_grid=0x5555555aec68,
    big_parts=0x5555555aec98, skew_angle=0x7fffffffc340) at src/textord/strokewidth.cpp:1310
#6  0x00007ffff7e72c08 in tesseract::StrokeWidth::GradeBlobsIntoPartitions (this=0x55555558c420,
    pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO, rerotation=...,
    block=block@entry=0x55555558bf20, nontext_pix=..., denorm=<optimized out>, cjk_script=false,
    projection=0x5555555aecc0, diacritic_blobs=0x7fffffffc688, part_grid=0x5555555aec68,
    big_parts=0x5555555aec98) at src/textord/strokewidth.cpp:379
#7  0x00007ffff7e2be71 in tesseract::ColumnFinder::FindBlocks (this=this@entry=0x5555555aeb30,
    pageseg_mode=pageseg_mode@entry=tesseract::PSM_AUTO, scaled_color=...,
    scaled_factor=<optimized out>, input_block=input_block@entry=0x55555558bf20, photo_mask_pix=...,
    thresholds_pix=..., grey_pix=..., pixa_debug=0x7ffff7c6abd0, blocks=0x7fffffffc5e8,
    diacritic_blobs=0x7fffffffc688, to_blocks=0x7fffffffc690) at src/textord/colfind.cpp:296
#8  0x00007ffff7d5509b in tesseract::Tesseract::AutoPageSeg (this=0x7ffff7c47010,
    pageseg_mode=tesseract::PSM_AUTO, blocks=0x5555555b0c90, to_blocks=0x7fffffffc690,
    diacritic_blobs=0x7fffffffc688, osd_tess=<optimized out>, osr=0x7fffffffca40)
    at src/ccmain/pagesegmain.cpp:226
#9  0x00007ffff7d5555d in tesseract::Tesseract::SegmentPage (this=0x7ffff7c47010,
    input_file=<optimized out>, blocks=0x5555555b0c90, osd_tess=osd_tess@entry=0x0,
    osr=osr@entry=0x7fffffffca40) at src/ccmain/pagesegmain.cpp:140
#10 0x00007ffff7d227bf in tesseract::TessBaseAPI::FindLines (this=0x7fffffffd780)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/include/g++-v9/bits/basic_string.h:2300
#11 0x00007ffff7d24f64 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffd780, monitor=0x0)
    at src/api/baseapi.cpp:838
#12 0x00007ffff7d2552a in tesseract::TessBaseAPI::ProcessPage (this=this@entry=0x7fffffffd780,
    pix=0x5555555b1c50, page_index=page_index@entry=0,
    filename=filename@entry=0x7fffffffdf54 "/tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm",
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0, renderer=
    0x5555555a2810) at src/api/baseapi.cpp:1259
#13 0x00007ffff7d26172 in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffd780,
    filename=<optimized out>, retry_config=0x0, timeout_millisec=0, renderer=0x5555555a2810)
    at src/api/baseapi.cpp:1218
#14 0x00007ffff7d2673f in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffffffd780,
    filename=filename@entry=0x7fffffffdf54 "/tmp/sim_new-york-times_1900-01-11_49_15-603_0008.ppm",
    retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=<optimized out>) at src/api/baseapi.cpp:1071
#15 0x0000555555558295 in main (argc=<optimized out>, argv=<optimized out>)
    at src/api/tesseractmain.cpp:783

Image: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008.ppm

egorpugin commented 3 years ago

fyi - 179MB sized image

MerlijnWajer commented 3 years ago

Right, sorry for not mentioning that. I could share the original JPEG2000 image if that is preferred. We process a lot of images at this size, (let's say probably 100,000 at this point) and very few fail this way. (At least this one, potentially two more)

egorpugin commented 3 years ago

It's hard to say where it's stuck or spends most of the time. Probably this could be profiled. Maybe it is just the big image. CLISTs are to be replaced with modern C++ somewhere in the future.

stweil commented 3 years ago

Related issue: #3369.

Tesseract shows that behaviour for images where it "detects" a huge number of boxes. Some parts of the layout detection seem to require time which increases with the square of that number.

The critical code finds and inserts into an unordered set.

We observe sometimes images which need more than an hour, too. Maybe the image here is a similar case. I'll run a test to see whether the OCR terminates.

MerlijnWajer commented 3 years ago

We observe sometimes images which need more than an hour, too. Maybe the image here is a similar case. I'll run a test to see whether the OCR terminates.

For this specific image, I believe I've let it run for a about a day. There a few images that precede this one, but they usually take 1.5 minutes, so the rest of the ~24 hours is for this one image. I believe the one reason it dies is memory exhaustion - but that is a guess.

Note that this run was not done with latest master, but using the 20201231 snapshot with one additional hOCR patch added.

2021-03-20 14:57:38,367 INFO     Processing pages with Tesseract now.
2021-03-21 14:18:50,772 WARNING  Tesseract failed with stdout: 'Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-10-g1236 with Leptonica\nWarning: Invalid resolution 0 dpi. Using 70 instead.\nEstimating resolution as 246\n'
Traceback (most recent call last):
  File "main.py", line 825, in 
    files = perform_ocr(scandata, img_dir, img_ext, tess_lang, env)
  File "main.py", line 544, in perform_ocr
    output = check_output(['tesseract',
  File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['tesseract', '-l', 'eng', '-c', 'tessedit_create_txt=1', '-c', 'tessedit_create_hocr=1', '-c', 'hocr_char_boxes=1', '-c', 'hocr_font_info=1', '/tmp/sim_new-york-times_1900-01-11_49_15-603_jp2/sim_new-york-times_1900-01-11_49_15-603_0008.jp2', '/tmp/sim_new-york-times_1900-01-11_49_15-603_jp2/sim_new-york-times_1900-01-11_49_15-603_0008']' died with .
MerlijnWajer commented 3 years ago

I am not sure if it is helpful, but I could surface the other images that have similar problems.

stweil commented 3 years ago

You can use those to test a fix (as soon as we have one), but I don't need more images for this issue.

My first test was killed by the Linux kernel after 75 minutes because Tesseract's memory usage increased continuously to more than 6 GiB (I had no swap space provided, and running three similar processes was simply too much for 16 GiB RAM). So the image here not only consumes much time (I still think OCR will finish finally) but also much memory. Maybe in your case the OCR was also stopped because of out-of-memory. Running dmesg will show whether the kernel killed a tesseract process.

A 2nd test was running 5 hours before it again was killed using about 10 GB RAM:

[263904.602999] Out of memory: Killed process 294046 (tesseract) total-vm:10983512kB, anon-rss:10017884kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:21484kB oom_score_adj:0
egorpugin commented 3 years ago

@stweil I've removed that custom hasher. I do not think that this will increate performance, but still we can check it. https://github.com/tesseract-ocr/tesseract/commit/50aec308b3d66c1b669ceb9160fd96000c250f6a

stweil commented 3 years ago

I already tried that, and it does not change the performance. A simplified custom hash function (without the division) also had no effect on the performance. I also tried using a sorted set instead of the unsorted one. That slightly increased the execution time.

amitdo commented 3 years ago

CLISTs are to be replaced with modern C++ somewhere in the future.

https://github.com/jimregan/tesseract-wiki-mine/blob/master/TesseractProjects.wiki#things-i-would-not-recommend-working-on

Someone suggested the macro-based list stuff as a candidate for replacement with stl... Compared to the macro-based lists in tesseract, stl lists are very different, very incompatitble, and IMHO a poor abstraction designed to make them as like vectors as possible, and if you use them the way they are used in tesseract, it would be very slow... It might be possible to sensibly convert the macro-based lists to (mostly) use templates though.

egorpugin commented 3 years ago

This is from 2008.

stweil commented 3 years ago

The Tesseract OCR terminates after running several days and using 16 GB or more RAM with a surprising result:

Tesseract Open Source OCR Engine v5.0.0-alpha-20210401-2-g1c50 with Leptonica
Estimating resolution as 246
Detected 28905 diacritics
Empty page!!
Estimating resolution as 246
Detected 28909 diacritics
Empty page!!

See also issue #3021 which reports full newspaper pages where Tesseract does not detect any text.

amitdo commented 3 years ago

This is from 2008.

It says that using std:list instead of the (intrusive) c lists will result in much slower code,

Ignore the part that rules out any use of the STL, which is outdated.

stweil commented 3 years ago

The Tesseract lists (CLIST, ELIST, ELIST2) are cyclic lists and use a very special construct for list iterations. That makes switching to STL lists difficult. At least the standard STL method size is much more performant with recent C++-17 than the equivalent Tesseract implementation length which counts the list elements by iterating over the whole list.

egorpugin commented 3 years ago

First thing is to replace those list macros with templates.

MerlijnWajer commented 3 years ago

I found that using Sauvola thresholding solves the problem for this image - it's possible that the Otsu thresholding just makes such a mess of the image that the segmenter has tremendous trouble interpreting the image.

You can find the thresholded image here: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008_thresholded.png (1.6MB) The plaintext here: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008_thresholded.txt (68K)

The runtime on my machine (Tesseract 4, stable) was just under four minutes:

real    3m58.898s
user    3m58.715s
sys 0m0.117s

I've ported a low-memory and fast Sauvola thresholding algorithm from this paper: https://arxiv.org/pdf/1905.13038.pdf and will start looking into making it possible for Tesseract to use that thresholding instead (per #3083 ). So perhaps once selectable binarisation is in place, this issue can be resolved.

zdenop commented 3 years ago

Did you tried pixSauvolaBinarize from leptonica?

MerlijnWajer commented 3 years ago

Did you tried pixSauvolaBinarize from leptonica?

Yes, I have experimented with that method too, but the binarisation step uses more ram (3.3GB vs 660MB). Tesseract finished in about 5-6 minutes using the leptonica Sauvola binarised image -- depending on the Sauvola parameters, of course.

MerlijnWajer commented 3 years ago

To be clear, my experiments are running Tesseract on an already binarised image (either made using the code I mentioned above, or the using the leptonica sauvola binarise). I know that is not ultimately how people should run Tesseract (on a binarised image for OCR quality purposes), but for the purpose of testing if it fixes this bug, it was easier.

I suspect that adding alternative binarisation to Tesseract (e.g. the leptonica binarise, or the one I wrote based on the paper) will also solve this problem on a non-binarised version of this image.

zdenop commented 3 years ago

I know that is not ultimately how people should run Tesseract (on a binarised image for OCR quality purposes

IMO this is exactly how tesseract should be run. Problem is that most of users want to OCR colourful images and they do not care about binarization, so tesseract is providing Otsu, that should work on most cases... And If you use binarized image, you set tessedit_do_invert to false ("-c tessedit_do_invert=0") to gain extra speed.

MerlijnWajer commented 3 years ago

I know that is not ultimately how people should run Tesseract (on a binarised image for OCR quality purposes

IMO this is exactly how tesseract should be run. Problem is that most of users want to OCR colourful images and they do not care about binarization, so tesseract is providing Otsu, that should work on most cases... And If you use binarized image, you set tessedit_do_invert to false ("-c tessedit_do_invert=0") to gain extra speed.

Understood, thanks. I remember (I don't know where) that the LSTM engine would potentially work better on grayscale images than binarised images. I'll look into adding Sauvola binarisation using leptonica's method to Tesseract, and then see if that opens up ways to add other binarisation methods.

amitdo commented 3 years ago

Leptonica has other binarization methods.

http://www.cvc.uab.es/icdar2009/papers/3725b375.pdf

ICDAR 2009 Document Image Binarization Contest (DIBCO 2009)

33) Google, Inc., Mountain View, USA (D. Bloomberg): a. Image binarization using a local background normalization, followed by a global threshold.
b. Image binarization using a local background normalization, followed by a modified Otsu approach to get a global threshold that can be applied to the normalized image. c. Image binarization using a local background normalization with two different thresholds. For the part of the image near the text, a high threshold can be chosen, to render the text fully in black. For the rest of the image, much of which is background, use a threshold based on the Otsu global value for the original image.

33c - 7th place, 33b - 11th place

MerlijnWajer commented 3 years ago

Cool - seems like worth checking out when working on adding Sauvola. I went with Sauvola after experimenting (and evaluating) with all the thresholding algorithms present in scikit-image (https://scikit-image.org/docs/dev/api/skimage.filters.html), in particular this note (and the paper): "This algorithm is originally designed for text recognition." I didn't evaulate the methods for the purpose of OCRing, though, but rather for the purpose of creating masks of the text (and lines in photos/images) for MRC compression.

amitdo commented 3 years ago

More methods with open source implementations:

https://github.com/opencv/opencv_contrib/blob/4a36e77dba0f8f2/modules/ximgproc/include/opencv2/ximgproc.hpp

https://github.com/ocropus/ocropy/blob/master/ocropus-gpageseg https://github.com/ocropus/ocropy/wiki/Publications#binarization

https://github.com/brandonmpetty/Doxa

zdenop commented 3 years ago

gamera (python framework for building document analysis applications) has also bunch of implementation of binarization.

ImageJ (java image processing program designed for scientific multidimensional images) has Auto Threshold plugin with several other methods.

Both projects use GPL3 licence, so we can not do copy&paste.

amitdo commented 3 years ago

With the code from #3418, the processing ends after 4:30 minutes, when Sauvola binarization is used. The output looks good.

Note that the image size is equivalent to 7 A4 pages, so the processing time is 38 second per page.

With adaptive Otsu I get 'Empty page!' after 36 seconds.

amitdo commented 3 years ago

The legacy Otsu is done on a full color image (not grayscale) and without tiles. This will lead to excessive amount of memory consumption on large images.

We need to limit the maximum image size in pixels (to 12M?) that the legacy Otsu is allowed to handle. For larger images, it should fallback to LeptonicaOtsu (with tile_size=2.0?).

Lambdac0re commented 2 years ago

Here is another image which absolutely wrecks Tesseract: https://i.imgur.com/0J8Ew.gif It also has lots of boxes like @stweil mentioned...