Speckled Documents Create Psychological Case for Tesseract

mlissner commented 8 years ago

Tesseract just spent seven hours trying to do OCR on the attached document. It's five pages long.

I'm fairly certain that the reason this takes so long is because of the speckling in the document. Other times when I've seen this kind of performance, it's been for similarly speckled documents.

Not sure what you can or should do about it, but since it seems to be a worst case scenario for Tesseract, I thought I'd report it.

This is on the latest version of Tesseract.

gov.uscourts.ctd.18812.88.0.pdf

amitdo commented 8 years ago

Tesseract just spent seven hours trying to do OCR on the attached document. It's five pages long.

Wow, this is really extreme!

amitdo commented 8 years ago

Well, I tested it and it takes less than 5 minutes...

Tesseract (the official command line tool) does not accept pdf as input, so how did you convert the pdf to a format that Tesseract accepts?

Here is what I did:

convert gov.uscourts.ctd.18812.88.0.pdf gov.png

This command will create 5 'gov-n.png' images.

First page:

tesseract gov-0.png gov-0

time: 1 minute and 5 seconds

stweil commented 8 years ago

One minute per page is not extraordinary much (although improvements which make it faster are of course welcome). My worst cases are currently double pages from a historic newspaper which take around ten minutes.

mlissner commented 8 years ago

Thanks for looking at this! We converted using ghostscript to multi-page tiff:

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o destination path

One minute/page is still pretty darned slow, but we'd welcome that at this point!

Shreeshrii commented 8 years ago

You could use gs to split the pdf into images and then ocr each separately and concatenate the result.

On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:

Thanks for looking at this! We converted using ghostscript to multi-page tiff:

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o destination path

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/431#issuecomment-249253159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7sVanef-bvL1nJdyJAdBJ0L3-2jks5qtAwkgaJpZM4KEcsN .

mlissner commented 8 years ago

Sure, but that's not the point...and anyway, it's not at all clear that the slowness is because it's a multipage tiff. I suspect if you ran this on each individual page of the tiff you'd have the same slowness.

Shreeshrii commented 8 years ago

To get accurate results, you will need to preprocess the images too to get rid of the background speckles.

You could try scantailor or imagemagick.

As a test, you can also try Vietocr GUI, and compare results with the command line output.

On 23 Sep 2016 11:35 p.m., Shree wrote:

You could use gs to split the pdf into images and then ocr each separately and concatenate the result.

On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:

Thanks for looking at this! We converted using ghostscript to multi-page tiff:

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o destination path

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/431#issuecomment-249253159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7sVanef-bvL1nJdyJAdBJ0L3-2jks5qtAwkgaJpZM4KEcsN .

amitdo commented 8 years ago

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r300x300 -o gov2.tiff gov.pdf

Your command creates a 730 MB tiff file, while my command creates 5 200-300 kB png files.

mlissner commented 8 years ago

Yeah, we saw this in testing, but went with TIFFs because they support multi-page images, which makes our OCR pipeline easier. In testing, we saw that the OCR for PDFs was no slower using large TIFFs than it was using PNGs because the process seems to be CPU bound no matter what.

If you use 300dpi PNGs do you get the slow performance I experienced with the 300dpi TIFFs? That's probably a better test, right?

amitdo commented 8 years ago

gov-0

Image properties: Width: 35.417 Height: 45.834 DPI: 72 X 72

This is equivalent to: Width: 8.5 Height: 11 DPI: 300 X 300

amitdo commented 8 years ago

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r72 -o gov.tiff gov.pdf OR gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -o gov.tiff gov.pdf

This command creates a 42 MB tiff file. The size in pixels of each page is the same as with my PNGs.

It takes 4 minutes and 29 seconds to Tesseract to read this tiff.

amitdo commented 8 years ago

@Shreeshrii commented:

To get accurate results, you will need to preprocess the images too to get rid of the background speckles.

I'm guessing that it will run faster too.

BTW, Here is what Tesseract outputs in the console:

time tesseract gov.tiff gov
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Detected 1875 diacritics
Page 2
Detected 1338 diacritics
Page 3
Detected 1885 diacritics
Page 4
Detected 658 diacritics
Page 5
Detected 213 diacritics

real    4m29.118s
user    4m28.972s
sys 0m0.152s

It 'thinks' the speckles are diacritics...

mlissner commented 8 years ago

Thanks for looking at this @amitdo.

This command creates a 42 MB tiff file. The size in pixels of each page is the same as with my PNGs.

But these aren't 300x300, which is apparently what provides the best OCR quality.[1] The point of this issue is that at 300x300, this takes seven hours to do five pages.

It 'thinks' the speckles are diacritics...

Yeah...that's an issue too. Running a despeckling filter first would help in this case, but we do OCR on millions of PDFs and we only need to despeckle the worst of them. For the rest, I imagine it would reduce quality (not to mention slow down the pipeline).

The point here is that Tesseract takes seven hours for a speckled document at the recommended DPI.

[1]: Some references:

From the FAQ: https://github.com/tesseract-ocr/tesseract/wiki/FAQ#is-there-a-minimum-text-size-it-wont-read-screen-text
"Optimal Image Conversion Settings for Tesseract" : https://mazira.com/blog/optimal-image-conversion-settings-tesseract-ocr
"Using Tesseract with PDF scans": http://kiirani.com/2013/03/22/tesseract-pdf.html

jbreiden commented 8 years ago

This PDF file is just a bag of images. This is very common and was probably produced by a photocopier or sheetfed scanner. Some fax machines make these too. It is entirely black and white. If you know you are working with black and white images, you can save a ton of space by using appropriate compression. This command renders 100% equivalent images for 2.3MB.

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -o gov2.tiff gov.pdf

That said, best practice for known 'bag of images' PDF is not to render anything. It is to extract the images, undisturbed. If necessary, adjust their header so that their resolution (e.g. 300 dpi) agrees with what the PDF was claiming. In an ideal world they would always be consistent already, but programmers screw this up all the time. That's the thing you feed to Tesseract (assuming you don't want to do any additional cleaning or something.) This workflow is kind of sophisticated and maybe not easy for everyone. But it makes more sense than potentially rescaling the images by rendering to a different dpi.

I just want to put this here because there are several different references being cited in this bug report about work flow. Please consider this one authoritative.

It does not however address the core question about dots, which seems like a legitimate concern. This will be an interesting test document for future development.

Shreeshrii commented 8 years ago

@mlissner It would have been helpful, if you had shared the info about your previous tests for this type of document

http://stackoverflow.com/questions/39110300/how-to-provide-image-to-tesseract-from-memory

https://github.com/mlissner/tesseract-performance-testing

vidiecan commented 8 years ago

We have seen similar documents taking very long (but still not an hour per page!). Therefore, whether it really is a tesseract issue should be investigated further.

@mlissner in order to increase performance and quality, you have to pre-process the image(s) for tesseract. For your specific case, use leptonica (tesseract already depends on it). Count the connected components, if there are too many, apply your filters. In a real word application where your documents have specific characteristics, you will not be able to avoid heavy pre-processing for tesseract in order to achieve reasonable results.

Look how tesseract uses leptonica and CCs e.g., https://github.com/tesseract-ocr/tesseract/search?utf8=%E2%9C%93&q=pixConnComp

amitdo commented 8 years ago

@jbreiden

This command renders 100% equivalent images for 2.3MB.

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -o gov2.tiff gov.pdf

This command will upscale the original images. It will make them more than 4 times larger. This is unnecessary because the DPI of the original images inside the pdf is 300X300, although the pdf itself falsely 'claims' that the DPI for these images is 72X72.

jbreiden commented 8 years ago

I was just encouraging -sDEVICE=tiffg4 over -sDEVICE=tiffgray for known black and white images. You are right, care should be taken to avoid rescaling, and that's the primary reason image extraction is safer than rendering.

Shreeshrii commented 8 years ago

@mlissner You could also look at the preprocessing workflow used by pdf sandwich https://sourceforge.net/projects/pdfsandwich/

amitdo commented 8 years ago

https://github.com/DanBloomberg/leptonica/search?q=noise

mlissner commented 8 years ago

Lots of responses here, so let me try to respond to as many as I can.

@amitdo and @jbreiden:

I considered using -sDEVICE=tiffg4 over -sDEVICE=tiffgray, but it's not purely black and white, and like I said, the bigger files don't seem to affect performance. Here's a comparison of a gray part of the original PDF:

tiffg4:

tiffgray: gray

tiffgray is definitely better for this, and since we're doing millions of files, it seems safer to use this approach than to assume all docs are purely black and white (even though it makes big files).

But setting that aside, it seems like using gs is the wrong approach regardless. Seems like the right approach is to extract the images undisturbed. Seems doable, but I'll have to do some research on this. Is it documented anywhere which image formats Tesseract supports natively? There's one question in StackOverflow that seems to address this, but otherwise I don't see a lot of guidance. I'm concerned that if we use the undisturbed images, we'll get weird image formats that Tesseract won't accept.

@jbreiden you also say:

If necessary, adjust their header so that their resolution agrees with what the PDF was claiming.

This feels wrong to me. In my experience, PDFs are a terrible source of ground truth. I'd expect the header information in the images to be much more accurate than whatever a PDF was reporting. You've provided a lot of information here already, but can you explain why we'd prefer the PDF data over the image data?

@vidiecan: I'll look into counting connected components. Seems like a great way to solve this, if it performs well enough. Thanks for this suggestion.

@Shreeshrii: I looked at PDF Sandwich, but didn't see anything useful. Do you know the code well enough to point me towards the image conversion part?

jbreiden commented 8 years ago

In this buggy broken world, do whatever it takes to get the resolution right. I rescind my recommendation to honor the PDF settings. If you crack open gov.uscourts.ctd.18812.88.0.pdf, you can see that it really does contain black and white images. The telltale is BitsPerComponent 1 and the internal use of CCITTFaxDecode, which only works on black and white.

<<
/Type /XObject
/Filter [/CCITTFaxDecode]
/Length 60 0 R
/Height 3300
/BitsPerComponent 1
/ColorSpace [/DeviceGray]
/DecodeParms [61 0 R]
/Subtype /Image
/Name /Im1
/Width 2550
>>

http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter

The embedded black and white image inside the PDF is already dithered. Ghostscript is innocent. Normally I prefer to feed Tesseract images that have been messed with as little as possible, but this may just be the exception. Tesseract is not trained on dithered text. Good luck with this one!

foo

jbreiden commented 8 years ago

If you choose to use morphology to remove the dots and undo the dither, Leptonica is very strong library for C or C++ programmers. A few morphology operations (erosions and dilations) hopefully would do the trick.

mlissner commented 8 years ago

@jbreiden, do you know the image formats supported by Tesseract?

jbreiden commented 8 years ago

Leptonica is responsible for decoding image file formats. The list of supported formats is here. Discard PDF (IFF_LPDF) and PS (IFF_LS ) because they are write-only, and discard SPIX because it is Leptonica specific. This support assumes that Leptonica is built with all imaging dependencies, which are optional. If you are running the Tesseract that ships on linux distributions such as Debian or Ubuntu, there should be no problems. You might have less support on cygwin or similar, depending on how Leptonica was built.

https://github.com/DanBloomberg/leptonica/blob/master/src/imageio.h#L92

jbreiden commented 8 years ago

.--. .-. --- -..- .. -- .- / -.-. . -. - .- ..- .-. .. / -...

I've made a few in-place edits on the bug to clarify the wording. Hopefully makes more sense now.

... . -. -.. / ... .--. .- -.-. . / -- .- .-. .. -. . ...

amitdo commented 8 years ago

I've made a few in-place edits on the bug to clarify the wording. Hopefully makes more sense now.

I deleted my previous message just after you made the edits. I thought that you didn't like my little joke... Clearly, I was wrong!

For the benefit of humankind, here it is again...

@jbreiden

Jeff, your last two messages look cryptic...

If you have been abducted by aliens, try give us a sign and we will rescue you! :)

<

.--. .-. --- -..- .. -- .- / -.-. . -. - .- ..- .-. .. / -...

... . -. -.. / ... .--. .- -.-. . / -- .- .-. .. -. . ...

Jeff, we are coming, stay calm!

LOL

amitdo commented 8 years ago

It's good to know Morse code, or maybe just to find an online Morse code translator... :)

amitdo commented 8 years ago

Even if you use -sDEVICE=tiffgray, you might want to use -sCompression=lzw.

mlissner commented 8 years ago

You might want to use -sCompression=lzw.

I just did some simple timings on this.

The good:

Compressed Tiffs are about 1-2% the size of the uncompressed versions (in a test I just did, uncompressed was 137M while compressed was 1.8M!).
Using compressed files used about 30% of the RAM (92MB instead of 301MB according to time -v).
LZW is a lossless format, so Tesseract generated identical results.
It took Tesseract about the same amount of time to do either format.

The bad:

It takes about twice as long to generate compressed tiffs from PDFs (though this is only a fraction of the total time doing OCR).

The hmmm:

Making a compressed tiff moves the processing burden from disk (making a big file) to CPU (compressing a big file).

Our bottleneck on our OCR server is CPU, so it's actually preferable for us to generate big files that use less CPU than to make small files that don't. OTOH, RAM is expensive, so we'll probably be switching this out. Thanks for the suggestion!

amitdo commented 8 years ago

For generating many 1-page images files, instead of one multi-page tiff file, use -o img-%d.tiff.

jbreiden commented 8 years ago

My first patch (dated March 28) in bug https://github.com/tesseract-ocr/tesseract/issues/233 will reduce RAM use with TIFF input. It stops Tesseract from buffering the input file before decompression. The patch should also should make the LZW case equal to the non-LZW case with respect to RAM. Note that I haven't tested on this particular example, so I'm saying "should" rather than "does".

Shreeshrii commented 8 years ago

Jeff, Why are we not commiting your patch from March?

On 4 Oct 2016 9:30 p.m., "jbreiden" notifications@github.com wrote:

My first patch (dated March 28) in this bug #233 https://github.com/tesseract-ocr/tesseract/issues/233 will reduce RAM use in TIFF. It stops Tesseract from buffering the input file before decompression. The patch should also should make the LZW case equal to the non-LZW case with respect to RAM. Note that I haven't tested on this particular example, so I'm saying "should" rather than "does".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/431#issuecomment-251488440, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o26v-ubHM_MSiloIl-YAaOzocpMPks5qwqlWgaJpZM4KEcsN .

jbreiden commented 8 years ago

The plan was for Ray to commit that patch. However, he has been too busy with the upcoming Tesseract 4.0 and over six months have passed. I think it is okay if someone wants to commit the patch. Please do not commit the second patch, though; that should wait until after the next Leptonica release.

zdenop commented 6 years ago

Jeff patch was applied, so closing this issue. If the issue exists in current code, please create new issues.

tesseract-ocr / tesseract

Speckled Documents Create Psychological Case for Tesseract #431