Open MerlijnWajer opened 3 years ago
fyi - 179MB sized image
Right, sorry for not mentioning that. I could share the original JPEG2000 image if that is preferred. We process a lot of images at this size, (let's say probably 100,000 at this point) and very few fail this way. (At least this one, potentially two more)
It's hard to say where it's stuck or spends most of the time. Probably this could be profiled. Maybe it is just the big image. CLISTs are to be replaced with modern C++ somewhere in the future.
Related issue: #3369.
Tesseract shows that behaviour for images where it "detects" a huge number of boxes. Some parts of the layout detection seem to require time which increases with the square of that number.
The critical code finds and inserts into an unordered set.
We observe sometimes images which need more than an hour, too. Maybe the image here is a similar case. I'll run a test to see whether the OCR terminates.
We observe sometimes images which need more than an hour, too. Maybe the image here is a similar case. I'll run a test to see whether the OCR terminates.
For this specific image, I believe I've let it run for a about a day. There a few images that precede this one, but they usually take 1.5 minutes, so the rest of the ~24 hours is for this one image. I believe the one reason it dies is memory exhaustion - but that is a guess.
Note that this run was not done with latest master, but using the 20201231 snapshot with one additional hOCR patch added.
2021-03-20 14:57:38,367 INFO Processing pages with Tesseract now.
2021-03-21 14:18:50,772 WARNING Tesseract failed with stdout: 'Tesseract Open Source OCR Engine v5.0.0-alpha-20201231-10-g1236 with Leptonica\nWarning: Invalid resolution 0 dpi. Using 70 instead.\nEstimating resolution as 246\n'
Traceback (most recent call last):
File "main.py", line 825, in
files = perform_ocr(scandata, img_dir, img_ext, tess_lang, env)
File "main.py", line 544, in perform_ocr
output = check_output(['tesseract',
File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['tesseract', '-l', 'eng', '-c', 'tessedit_create_txt=1', '-c', 'tessedit_create_hocr=1', '-c', 'hocr_char_boxes=1', '-c', 'hocr_font_info=1', '/tmp/sim_new-york-times_1900-01-11_49_15-603_jp2/sim_new-york-times_1900-01-11_49_15-603_0008.jp2', '/tmp/sim_new-york-times_1900-01-11_49_15-603_jp2/sim_new-york-times_1900-01-11_49_15-603_0008']' died with .
I am not sure if it is helpful, but I could surface the other images that have similar problems.
You can use those to test a fix (as soon as we have one), but I don't need more images for this issue.
My first test was killed by the Linux kernel after 75 minutes because Tesseract's memory usage increased continuously to more than 6 GiB (I had no swap space provided, and running three similar processes was simply too much for 16 GiB RAM). So the image here not only consumes much time (I still think OCR will finish finally) but also much memory. Maybe in your case the OCR was also stopped because of out-of-memory. Running dmesg
will show whether the kernel killed a tesseract process.
A 2nd test was running 5 hours before it again was killed using about 10 GB RAM:
[263904.602999] Out of memory: Killed process 294046 (tesseract) total-vm:10983512kB, anon-rss:10017884kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:21484kB oom_score_adj:0
@stweil I've removed that custom hasher. I do not think that this will increate performance, but still we can check it. https://github.com/tesseract-ocr/tesseract/commit/50aec308b3d66c1b669ceb9160fd96000c250f6a
I already tried that, and it does not change the performance. A simplified custom hash function (without the division) also had no effect on the performance. I also tried using a sorted set instead of the unsorted one. That slightly increased the execution time.
CLISTs are to be replaced with modern C++ somewhere in the future.
Someone suggested the macro-based list stuff as a candidate for replacement with stl... Compared to the macro-based lists in tesseract, stl lists are very different, very incompatitble, and IMHO a poor abstraction designed to make them as like vectors as possible, and if you use them the way they are used in tesseract, it would be very slow... It might be possible to sensibly convert the macro-based lists to (mostly) use templates though.
This is from 2008.
The Tesseract OCR terminates after running several days and using 16 GB or more RAM with a surprising result:
Tesseract Open Source OCR Engine v5.0.0-alpha-20210401-2-g1c50 with Leptonica
Estimating resolution as 246
Detected 28905 diacritics
Empty page!!
Estimating resolution as 246
Detected 28909 diacritics
Empty page!!
See also issue #3021 which reports full newspaper pages where Tesseract does not detect any text.
This is from 2008.
It says that using std:list instead of the (intrusive) c lists will result in much slower code,
Ignore the part that rules out any use of the STL, which is outdated.
The Tesseract lists (CLIST
, ELIST
, ELIST2
) are cyclic lists and use a very special construct for list iterations. That makes switching to STL lists difficult. At least the standard STL method size
is much more performant with recent C++-17 than the equivalent Tesseract implementation length
which counts the list elements by iterating over the whole list.
First thing is to replace those list macros with templates.
I found that using Sauvola thresholding solves the problem for this image - it's possible that the Otsu thresholding just makes such a mess of the image that the segmenter has tremendous trouble interpreting the image.
You can find the thresholded image here: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008_thresholded.png (1.6MB) The plaintext here: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008_thresholded.txt (68K)
The runtime on my machine (Tesseract 4, stable) was just under four minutes:
real 3m58.898s
user 3m58.715s
sys 0m0.117s
I've ported a low-memory and fast Sauvola thresholding algorithm from this paper: https://arxiv.org/pdf/1905.13038.pdf and will start looking into making it possible for Tesseract to use that thresholding instead (per #3083 ). So perhaps once selectable binarisation is in place, this issue can be resolved.
Did you tried pixSauvolaBinarize from leptonica?
Did you tried pixSauvolaBinarize from leptonica?
Yes, I have experimented with that method too, but the binarisation step uses more ram (3.3GB vs 660MB). Tesseract finished in about 5-6 minutes using the leptonica Sauvola binarised image -- depending on the Sauvola parameters, of course.
To be clear, my experiments are running Tesseract on an already binarised image (either made using the code I mentioned above, or the using the leptonica sauvola binarise). I know that is not ultimately how people should run Tesseract (on a binarised image for OCR quality purposes), but for the purpose of testing if it fixes this bug, it was easier.
I suspect that adding alternative binarisation to Tesseract (e.g. the leptonica binarise, or the one I wrote based on the paper) will also solve this problem on a non-binarised version of this image.
I know that is not ultimately how people should run Tesseract (on a binarised image for OCR quality purposes
IMO this is exactly how tesseract should be run. Problem is that most of users want to OCR colourful images and they do not care about binarization, so tesseract is providing Otsu, that should work on most cases...
And If you use binarized image, you set tessedit_do_invert
to false
("-c tessedit_do_invert=0") to gain extra speed.
I know that is not ultimately how people should run Tesseract (on a binarised image for OCR quality purposes
IMO this is exactly how tesseract should be run. Problem is that most of users want to OCR colourful images and they do not care about binarization, so tesseract is providing Otsu, that should work on most cases... And If you use binarized image, you set
tessedit_do_invert
tofalse
("-c tessedit_do_invert=0") to gain extra speed.
Understood, thanks. I remember (I don't know where) that the LSTM engine would potentially work better on grayscale images than binarised images. I'll look into adding Sauvola binarisation using leptonica's method to Tesseract, and then see if that opens up ways to add other binarisation methods.
Leptonica has other binarization methods.
http://www.cvc.uab.es/icdar2009/papers/3725b375.pdf
ICDAR 2009 Document Image Binarization Contest (DIBCO 2009)
33) Google, Inc., Mountain View, USA (D. Bloomberg): a. Image binarization using a local background normalization, followed by a global threshold.
b. Image binarization using a local background normalization, followed by a modified Otsu approach to get a global threshold that can be applied to the normalized image. c. Image binarization using a local background normalization with two different thresholds. For the part of the image near the text, a high threshold can be chosen, to render the text fully in black. For the rest of the image, much of which is background, use a threshold based on the Otsu global value for the original image.
33c - 7th place, 33b - 11th place
Cool - seems like worth checking out when working on adding Sauvola. I went with Sauvola after experimenting (and evaluating) with all the thresholding algorithms present in scikit-image (https://scikit-image.org/docs/dev/api/skimage.filters.html), in particular this note (and the paper): "This algorithm is originally designed for text recognition." I didn't evaulate the methods for the purpose of OCRing, though, but rather for the purpose of creating masks of the text (and lines in photos/images) for MRC compression.
More methods with open source implementations:
https://github.com/ocropus/ocropy/blob/master/ocropus-gpageseg https://github.com/ocropus/ocropy/wiki/Publications#binarization
gamera (python framework for building document analysis applications) has also bunch of implementation of binarization.
ImageJ (java image processing program designed for scientific multidimensional images) has Auto Threshold plugin with several other methods.
Both projects use GPL3 licence, so we can not do copy&paste.
With the code from #3418, the processing ends after 4:30 minutes, when Sauvola binarization is used. The output looks good.
Note that the image size is equivalent to 7 A4 pages, so the processing time is 38 second per page.
With adaptive Otsu I get 'Empty page!' after 36 seconds.
The legacy Otsu
is done on a full color image (not grayscale) and without tiles. This will lead to excessive amount of memory consumption on large images.
We need to limit the maximum image size in pixels (to 12M
?) that the legacy Otsu
is allowed to handle. For larger images, it should fallback to LeptonicaOtsu
(with tile_size=2.0
?).
Here is another image which absolutely wrecks Tesseract: https://i.imgur.com/0J8Ew.gif It also has lots of boxes like @stweil mentioned...
Environment
master
23ed59bd7bca777e4e104c4ee540843373aa9869
Linux gentoo-x13 5.11.7-gentoo-dist #1 SMP Wed Mar 17 21:03:41 -00 2021 x86_64 AMD Ryzen 7 PRO 4750U with Radeon Graphics AuthenticAMD GNU/Linux
Current Behavior:
Tesseract hangs, seemingly never finishes
Expected Behavior:
Tesseract doesn't hang and produces output normally
GDB backtrace (interrupted after more than 5 minutes):
Image: https://archive.org/~merlijn/tesseract-images/sim_new-york-times_1900-01-11_49_15-603_0008.ppm