tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.14k stars 9.5k forks source link

tesseract process never finishes with specific gif image #3369

Open wix-andriusb opened 3 years ago

wix-andriusb commented 3 years ago

Environment

tesseract 4.1.1

reproduced on macosx and linux

uname -a
Darwin VL-C02WL1AYHTD6 19.6.0 Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64 x86_64
Linux ocr-5b7bf86f6-f6qsd 5.4.65-wix #1 SMP Thu Nov 19 15:24:12 UTC 2020 x86_64 GNU/Linux

Current Behavior:

running tesseract in command line on this image https://bentkus.eu/ocr_while_true.gif does not finish after 1h

tesseract ocr_while_true.gif ocr_while_true --dpi 150

Expected Behavior:

process should finish in 2 minutes

Suggested Fix:

I'll try to build and see why it never stops

upd. (by @egorpugin): test png - https://bentkus.eu/ocr_while_loop.png

stweil commented 3 years ago

That GIF file is special: it includes 125 images. How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?

amitdo commented 3 years ago

This is a gif animation.

Convert it to static images and give them to tesserct as input.

stweil commented 3 years ago

Other issues where OCR never finishes: #2196, #2288.

stweil commented 3 years ago

Convert it to static images [...]

The static images work fine. Nevertheless handling of animated GIF images has to be well defined, see my question above.

wix-andriusb commented 3 years ago

Ok, thanks for the advice, I should handle this on my side, check for gif and slice and analyze it. I tried other gifs and saw it finishing so I assumed that this should work too on this gif.

wix-andriusb commented 3 years ago

Can you maybe tell me what tool you used to create the static images?

egorpugin commented 3 years ago

How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?

My answer is 'Create OCR for all images'

stweil commented 3 years ago

Can you maybe tell me what tool you used to create the static images?

I used convert FROM.gif TO.png.

wix-andriusb commented 3 years ago

My answer is 'Create OCR for all images'

Could be configurable through arguments with the default being do OCR for all images

egorpugin commented 3 years ago

It's should be something like cat 1.txt 2.txt 3.txt ... When we pass multiple images, they all must be processed. Same for multipage images (.gif, .tiff) if such format is enabled in leptonica.

stweil commented 3 years ago

So the handling of animated GIF should be similar to multipage TIFF (which either processes all pages or a selected page as far as I remember).

Maybe in a first step throwing an "unimplemented" error is easier. I am not sure how Leptonica supports animated GIF.

stweil commented 3 years ago

The static images work fine.

I was mistaken. Not all static images work fine. The first one which looks empty ~does not terminate~ requires more than 4 minutes.

amitdo commented 3 years ago

We depend on Leptonica for image IO. Can it handle gif animation? @DanBloomberg

What we need from Leptonica:

This way we can treat it like we treat multi-page tiff.

wix-andriusb commented 3 years ago

this image is the offender, a blank page with specific color

amitdo commented 3 years ago

The first one which looks empty does not terminate.

So Leptonica probably only sees the first image and returns it as pix.

amitdo commented 3 years ago

Please attach the first image.

egorpugin commented 3 years ago

I've recorded first N GBs of debug logs in the infinite loop.

Smooothing part at:Bounding box=(-1888,1064)->(-1884,1067)
Smooothing part at:Bounding box=(-1886,1066)->(-1882,1069)
Smooothing part at:Bounding box=(-1884,1067)->(-1879,1071)
Smooothing part at:Bounding box=(-1874,1074)->(-1870,1077)
Smooothing part at:Bounding box=(-1825,1070)->(-1814,1080)
Smooothing part at:Bounding box=(-1788,1071)->(-1784,1074)

Is it tess specific thing or a bug? negative numbers in bbox

stweil commented 3 years ago

I now have run latest Tesseract production code on the original animated GIF image. The image is processed, and Tesseract returns a "result" for the first included image. This takes 4:26 minutes, so it finishes, but takes rather long for an image which looks empty for me but obviously includes lots of small colour variations (otherwise the PNG file would be much smaller).

stweil commented 3 years ago

@wix-andriusb, how long did you wait for "never finished"? Depending on your machine, it might take at least 4 minutes, but maybe also 20 minutes. Of course this can nevertheless be considered as a bug.

stweil commented 3 years ago

Is it tess specific thing or a bug? negative numbers in bbox

The original image is 1080 x 1920, so those box coordinates look definitely strange, not only because they are negative, but also because the absolute x values exceed the image width.

egorpugin commented 3 years ago

We can try to cut the image to, let's say, 50x50 and check it.

egorpugin commented 3 years ago

Is is possible to implement faster pixel counting?

image

wix-andriusb commented 3 years ago

Locally (macosx) I have installed 4.11 with brew, I'm running it and i'm past 8 minutes now.

4.00 on a linux server was running for hours before it got killed image

egorpugin commented 3 years ago

For 50x100 reduced image -

Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)
Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)

Full log of that loop - 1.txt

DanBloomberg commented 3 years ago

referring to Amit's comment, I attempted to implement writing of gif anim about 4 years ago, but failed. I left questions for the gif inventor/maintainer, but he did not engage. So I Implemented writing of webp anim instead.

Never tried reading animated gif into a pixa.

DanBloomberg commented 3 years ago

And if someone shows me how to tell if a gif file is an animated gif, I'll use it in the gif reader to skip ("not supported") reading. I believe that would mostly solve this issue.

stweil commented 3 years ago

I don't think there is a high desire to have advanced OCR support for animated GIF file. That's a very special rare need. Obviously the first image in an animated GIF is already read and processed with the current code. Processing all images in a file can be done with a simple external conversion.

So the animated GIF issue has very low priority for me.

The huge time which is required to process an image without visible content is more important for me, as I expect that "normal" scans with text can suffer from extended processing time, too. And OCR processing time has high priority.

egorpugin commented 3 years ago

Funny, I tried to optimize hot path using pixCountPixelsInRect instead of pixCountPixels. I thought it won't create a new pix, but it does exactly the same as the commented code on the left side.

image

DanBloomberg commented 3 years ago

Both pixRasterop() and pixCountPixels() are optimized, so using them together -- first cropping the rectangle with rasterop and then counting the ON pixels -- is very efficient.

egorpugin commented 3 years ago

But is it possible to count pixels directly on the original pix?

DanBloomberg commented 3 years ago

Yes, of course, but it would be a bit complicated to do it efficiently. The 1 bpp image has 32 pixels in each word. Each raster line in general would have a partial word at the beginning, a series of complete words, and a partial word at the end. And the first partial word might be the only one that has any pixels, so you have to worry about that case as well. You would need to mask and shift the two partial words before running them, byte by byte through the table that counts ON pixels. You can see how this is done for the last partial word in pixCountPixels().

You are welcome to extend pixCountPixels() to take an arbitrary rectangle :-)

egorpugin commented 3 years ago

I'm thinking here also about 8bpp b/w image to speedup such calcs. The question is how will this increase overall memory consumption. Do we really need 1bpp in tess?

DanBloomberg commented 3 years ago

All this pixel counting is for 1 bpp. With 8 bpp it is much simpler to do most calculations efficiently. For example, with 8 bpp you might be making histograms.

stweil commented 3 years ago

@egorpugin, are you sure that pixCountPixels is the bottleneck here? gprof shows that most of the time is spent in 406709317 calls of GridSearch which calls 1355965899 times std::_Hashtable, so find and insert for the std::unordered_list are the time critical operations. Obviously that list is not small, so those operations cost a lot of time.

It is possible to optimize the code and use only insert, no find, but that gives only a very small improvement.

egorpugin commented 3 years ago

On windows I see that pixCountPixels is the slowest part. See https://github.com/tesseract-ocr/tesseract/issues/3369#issuecomment-809326001

DanBloomberg commented 3 years ago

Make sure that you are calling pixCountPixelsInRect() with tab8 defined as the 4th arg. Otherwise, each time pixCountPixels() is called, it has to make the 256-entry tab8.

amitdo commented 3 years ago

https://bentkus.eu/ocr_while_loop.png

With the code from #3418, the processing ends after less than half second, when Sauvola binarization is used.