Open wix-andriusb opened 3 years ago
That GIF file is special: it includes 125 images. How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?
This is a gif animation.
Convert it to static images and give them to tesserct as input.
Other issues where OCR never finishes: #2196, #2288.
Convert it to static images [...]
The static images work fine. Nevertheless handling of animated GIF images has to be well defined, see my question above.
Ok, thanks for the advice, I should handle this on my side, check for gif and slice and analyze it. I tried other gifs and saw it finishing so I assumed that this should work too on this gif.
Can you maybe tell me what tool you used to create the static images?
How should Tesseract handle animated GIF images? Create OCR for all images, or only for the first one, or refuse to process such files?
My answer is 'Create OCR for all images'
Can you maybe tell me what tool you used to create the static images?
I used convert FROM.gif TO.png
.
My answer is 'Create OCR for all images'
Could be configurable through arguments with the default being do OCR for all images
It's should be something like cat 1.txt 2.txt 3.txt ...
When we pass multiple images, they all must be processed. Same for multipage images (.gif, .tiff) if such format is enabled in leptonica.
So the handling of animated GIF should be similar to multipage TIFF (which either processes all pages or a selected page as far as I remember).
Maybe in a first step throwing an "unimplemented" error is easier. I am not sure how Leptonica supports animated GIF.
The static images work fine.
I was mistaken. Not all static images work fine. The first one which looks empty ~does not terminate~ requires more than 4 minutes.
We depend on Leptonica for image IO. Can it handle gif animation? @DanBloomberg
What we need from Leptonica:
This way we can treat it like we treat multi-page tiff.
this image is the offender, a blank page with specific color
The first one which looks empty does not terminate.
So Leptonica probably only sees the first image and returns it as pix.
Please attach the first image.
I've recorded first N GBs of debug logs in the infinite loop.
Smooothing part at:Bounding box=(-1888,1064)->(-1884,1067)
Smooothing part at:Bounding box=(-1886,1066)->(-1882,1069)
Smooothing part at:Bounding box=(-1884,1067)->(-1879,1071)
Smooothing part at:Bounding box=(-1874,1074)->(-1870,1077)
Smooothing part at:Bounding box=(-1825,1070)->(-1814,1080)
Smooothing part at:Bounding box=(-1788,1071)->(-1784,1074)
Is it tess specific thing or a bug? negative numbers in bbox
I now have run latest Tesseract production code on the original animated GIF image. The image is processed, and Tesseract returns a "result" for the first included image. This takes 4:26 minutes, so it finishes, but takes rather long for an image which looks empty for me but obviously includes lots of small colour variations (otherwise the PNG file would be much smaller).
@wix-andriusb, how long did you wait for "never finished"? Depending on your machine, it might take at least 4 minutes, but maybe also 20 minutes. Of course this can nevertheless be considered as a bug.
Is it tess specific thing or a bug? negative numbers in bbox
The original image is 1080 x 1920, so those box coordinates look definitely strange, not only because they are negative, but also because the absolute x values exceed the image width.
We can try to cut the image to, let's say, 50x50 and check it.
Is is possible to implement faster pixel counting?
Locally (macosx) I have installed 4.11 with brew, I'm running it and i'm past 8 minutes now.
4.00 on a linux server was running for hours before it got killed
For 50x100 reduced image -
Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)
Smooothing part at:Bounding box=(-100,1)->(0,5)
Smooothing part at:Bounding box=(-100,6)->(0,10)
Smooothing part at:Bounding box=(-100,11)->(0,15)
Smooothing part at:Bounding box=(-100,16)->(0,20)
Smooothing part at:Bounding box=(-100,21)->(0,25)
Smooothing part at:Bounding box=(-100,26)->(0,30)
Smooothing part at:Bounding box=(-100,31)->(0,35)
Smooothing part at:Bounding box=(-100,36)->(0,40)
Smooothing part at:Bounding box=(-100,41)->(0,45)
Smooothing part at:Bounding box=(-100,46)->(0,50)
Full log of that loop - 1.txt
referring to Amit's comment, I attempted to implement writing of gif anim about 4 years ago, but failed. I left questions for the gif inventor/maintainer, but he did not engage. So I Implemented writing of webp anim instead.
Never tried reading animated gif into a pixa.
And if someone shows me how to tell if a gif file is an animated gif, I'll use it in the gif reader to skip ("not supported") reading. I believe that would mostly solve this issue.
I don't think there is a high desire to have advanced OCR support for animated GIF file. That's a very special rare need. Obviously the first image in an animated GIF is already read and processed with the current code. Processing all images in a file can be done with a simple external conversion.
So the animated GIF issue has very low priority for me.
The huge time which is required to process an image without visible content is more important for me, as I expect that "normal" scans with text can suffer from extended processing time, too. And OCR processing time has high priority.
Funny, I tried to optimize hot path using pixCountPixelsInRect
instead of pixCountPixels
. I thought it won't create a new pix, but it does exactly the same as the commented code on the left side.
Both pixRasterop() and pixCountPixels() are optimized, so using them together -- first cropping the rectangle with rasterop and then counting the ON pixels -- is very efficient.
But is it possible to count pixels directly on the original pix?
Yes, of course, but it would be a bit complicated to do it efficiently. The 1 bpp image has 32 pixels in each word. Each raster line in general would have a partial word at the beginning, a series of complete words, and a partial word at the end. And the first partial word might be the only one that has any pixels, so you have to worry about that case as well. You would need to mask and shift the two partial words before running them, byte by byte through the table that counts ON pixels. You can see how this is done for the last partial word in pixCountPixels().
You are welcome to extend pixCountPixels() to take an arbitrary rectangle :-)
I'm thinking here also about 8bpp b/w image to speedup such calcs. The question is how will this increase overall memory consumption. Do we really need 1bpp in tess?
All this pixel counting is for 1 bpp. With 8 bpp it is much simpler to do most calculations efficiently. For example, with 8 bpp you might be making histograms.
@egorpugin, are you sure that pixCountPixels
is the bottleneck here? gprof
shows that most of the time is spent in 406709317 calls of GridSearch
which calls 1355965899 times std::_Hashtable
, so find
and insert
for the std::unordered_list
are the time critical operations. Obviously that list is not small, so those operations cost a lot of time.
It is possible to optimize the code and use only insert
, no find
, but that gives only a very small improvement.
On windows I see that pixCountPixels is the slowest part. See https://github.com/tesseract-ocr/tesseract/issues/3369#issuecomment-809326001
Make sure that you are calling pixCountPixelsInRect() with tab8 defined as the 4th arg. Otherwise, each time pixCountPixels() is called, it has to make the 256-entry tab8.
https://bentkus.eu/ocr_while_loop.png
With the code from #3418, the processing ends after less than half second, when Sauvola binarization is used.
Environment
tesseract 4.1.1
reproduced on macosx and linux
Current Behavior:
running tesseract in command line on this image https://bentkus.eu/ocr_while_true.gif does not finish after 1h
Expected Behavior:
process should finish in 2 minutes
Suggested Fix:
I'll try to build and see why it never stops
upd. (by @egorpugin): test png - https://bentkus.eu/ocr_while_loop.png