Closed mlissner closed 6 years ago
Tesseract just spent seven hours trying to do OCR on the attached document. It's five pages long.
Wow, this is really extreme!
Well, I tested it and it takes less than 5 minutes...
Tesseract (the official command line tool) does not accept pdf as input, so how did you convert the pdf to a format that Tesseract accepts?
Here is what I did:
convert gov.uscourts.ctd.18812.88.0.pdf gov.png
This command will create 5 'gov-n.png' images.
First page:
tesseract gov-0.png gov-0
time: 1 minute and 5 seconds
One minute per page is not extraordinary much (although improvements which make it faster are of course welcome). My worst cases are currently double pages from a historic newspaper which take around ten minutes.
Thanks for looking at this! We converted using ghostscript to multi-page tiff:
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o destination path
One minute/page is still pretty darned slow, but we'd welcome that at this point!
You could use gs to split the pdf into images and then ocr each separately and concatenate the result.
On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:
Thanks for looking at this! We converted using ghostscript to multi-page tiff:
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o destination path
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/431#issuecomment-249253159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7sVanef-bvL1nJdyJAdBJ0L3-2jks5qtAwkgaJpZM4KEcsN .
Sure, but that's not the point...and anyway, it's not at all clear that the slowness is because it's a multipage tiff. I suspect if you ran this on each individual page of the tiff you'd have the same slowness.
To get accurate results, you will need to preprocess the images too to get rid of the background speckles.
You could try scantailor or imagemagick.
As a test, you can also try Vietocr GUI, and compare results with the command line output.
On 23 Sep 2016 11:35 p.m., Shree wrote:
You could use gs to split the pdf into images and then ocr each separately and concatenate the result.
On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:
Thanks for looking at this! We converted using ghostscript to multi-page tiff:
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o destination path
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/431#issuecomment-249253159, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7sVanef-bvL1nJdyJAdBJ0L3-2jks5qtAwkgaJpZM4KEcsN .
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r300x300 -o gov2.tiff gov.pdf
Your command creates a 730 MB tiff file, while my command creates 5 200-300 kB png files.
Yeah, we saw this in testing, but went with TIFFs because they support multi-page images, which makes our OCR pipeline easier. In testing, we saw that the OCR for PDFs was no slower using large TIFFs than it was using PNGs because the process seems to be CPU bound no matter what.
If you use 300dpi PNGs do you get the slow performance I experienced with the 300dpi TIFFs? That's probably a better test, right?
Image properties: Width: 35.417 Height: 45.834 DPI: 72 X 72
This is equivalent to: Width: 8.5 Height: 11 DPI: 300 X 300
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r72 -o gov.tiff gov.pdf
OR
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -o gov.tiff gov.pdf
This command creates a 42 MB tiff file. The size in pixels of each page is the same as with my PNGs.
It takes 4 minutes and 29 seconds to Tesseract to read this tiff.
@Shreeshrii commented:
To get accurate results, you will need to preprocess the images too to get rid of the background speckles.
I'm guessing that it will run faster too.
BTW, Here is what Tesseract outputs in the console:
time tesseract gov.tiff gov
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Detected 1875 diacritics
Page 2
Detected 1338 diacritics
Page 3
Detected 1885 diacritics
Page 4
Detected 658 diacritics
Page 5
Detected 213 diacritics
real 4m29.118s
user 4m28.972s
sys 0m0.152s
It 'thinks' the speckles are diacritics...
Thanks for looking at this @amitdo.
This command creates a 42 MB tiff file. The size in pixels of each page is the same as with my PNGs.
But these aren't 300x300, which is apparently what provides the best OCR quality.[1] The point of this issue is that at 300x300, this takes seven hours to do five pages.
It 'thinks' the speckles are diacritics...
Yeah...that's an issue too. Running a despeckling filter first would help in this case, but we do OCR on millions of PDFs and we only need to despeckle the worst of them. For the rest, I imagine it would reduce quality (not to mention slow down the pipeline).
The point here is that Tesseract takes seven hours for a speckled document at the recommended DPI.
[1]: Some references:
This PDF file is just a bag of images. This is very common and was probably produced by a photocopier or sheetfed scanner. Some fax machines make these too. It is entirely black and white. If you know you are working with black and white images, you can save a ton of space by using appropriate compression. This command renders 100% equivalent images for 2.3MB.
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -o gov2.tiff gov.pdf
That said, best practice for known 'bag of images' PDF is not to render anything. It is to extract the images, undisturbed. If necessary, adjust their header so that their resolution (e.g. 300 dpi) agrees with what the PDF was claiming. In an ideal world they would always be consistent already, but programmers screw this up all the time. That's the thing you feed to Tesseract (assuming you don't want to do any additional cleaning or something.) This workflow is kind of sophisticated and maybe not easy for everyone. But it makes more sense than potentially rescaling the images by rendering to a different dpi.
I just want to put this here because there are several different references being cited in this bug report about work flow. Please consider this one authoritative.
It does not however address the core question about dots, which seems like a legitimate concern. This will be an interesting test document for future development.
@mlissner It would have been helpful, if you had shared the info about your previous tests for this type of document
http://stackoverflow.com/questions/39110300/how-to-provide-image-to-tesseract-from-memory
We have seen similar documents taking very long (but still not an hour per page!). Therefore, whether it really is a tesseract issue should be investigated further.
@mlissner in order to increase performance and quality, you have to pre-process the image(s) for tesseract. For your specific case, use leptonica
(tesseract already depends on it). Count the connected components, if there are too many, apply your filters. In a real word application where your documents have specific characteristics, you will not be able to avoid heavy pre-processing for tesseract in order to achieve reasonable results.
Look how tesseract uses leptonica and CCs e.g., https://github.com/tesseract-ocr/tesseract/search?utf8=%E2%9C%93&q=pixConnComp
@jbreiden
This command renders 100% equivalent images for 2.3MB.
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -o gov2.tiff gov.pdf
This command will upscale the original images. It will make them more than 4 times larger. This is unnecessary because the DPI of the original images inside the pdf is 300X300, although the pdf itself falsely 'claims' that the DPI for these images is 72X72.
I was just encouraging -sDEVICE=tiffg4 over -sDEVICE=tiffgray for known black and white images. You are right, care should be taken to avoid rescaling, and that's the primary reason image extraction is safer than rendering.
@mlissner You could also look at the preprocessing workflow used by pdf sandwich https://sourceforge.net/projects/pdfsandwich/
Lots of responses here, so let me try to respond to as many as I can.
@amitdo and @jbreiden:
I considered using -sDEVICE=tiffg4
over -sDEVICE=tiffgray
, but it's not purely black and white, and like I said, the bigger files don't seem to affect performance. Here's a comparison of a gray part of the original PDF:
tiffg4
:
tiffgray
:
tiffgray
is definitely better for this, and since we're doing millions of files, it seems safer to use this approach than to assume all docs are purely black and white (even though it makes big files).
But setting that aside, it seems like using gs
is the wrong approach regardless. Seems like the right approach is to extract the images undisturbed. Seems doable, but I'll have to do some research on this. Is it documented anywhere which image formats Tesseract supports natively? There's one question in StackOverflow that seems to address this, but otherwise I don't see a lot of guidance. I'm concerned that if we use the undisturbed images, we'll get weird image formats that Tesseract won't accept.
@jbreiden you also say:
If necessary, adjust their header so that their resolution agrees with what the PDF was claiming.
This feels wrong to me. In my experience, PDFs are a terrible source of ground truth. I'd expect the header information in the images to be much more accurate than whatever a PDF was reporting. You've provided a lot of information here already, but can you explain why we'd prefer the PDF data over the image data?
@vidiecan: I'll look into counting connected components. Seems like a great way to solve this, if it performs well enough. Thanks for this suggestion.
@Shreeshrii: I looked at PDF Sandwich, but didn't see anything useful. Do you know the code well enough to point me towards the image conversion part?
In this buggy broken world, do whatever it takes to get the resolution right. I rescind my recommendation to honor the PDF settings. If you crack open gov.uscourts.ctd.18812.88.0.pdf, you can see that it really does contain black and white images. The telltale is BitsPerComponent 1 and the internal use of CCITTFaxDecode, which only works on black and white.
<<
/Type /XObject
/Filter [/CCITTFaxDecode]
/Length 60 0 R
/Height 3300
/BitsPerComponent 1
/ColorSpace [/DeviceGray]
/DecodeParms [61 0 R]
/Subtype /Image
/Name /Im1
/Width 2550
>>
http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
The embedded black and white image inside the PDF is already dithered. Ghostscript is innocent. Normally I prefer to feed Tesseract images that have been messed with as little as possible, but this may just be the exception. Tesseract is not trained on dithered text. Good luck with this one!
If you choose to use morphology to remove the dots and undo the dither, Leptonica is very strong library for C or C++ programmers. A few morphology operations (erosions and dilations) hopefully would do the trick.
@jbreiden, do you know the image formats supported by Tesseract?
Leptonica is responsible for decoding image file formats. The list of supported formats is here. Discard PDF (IFF_LPDF) and PS (IFF_LS ) because they are write-only, and discard SPIX because it is Leptonica specific. This support assumes that Leptonica is built with all imaging dependencies, which are optional. If you are running the Tesseract that ships on linux distributions such as Debian or Ubuntu, there should be no problems. You might have less support on cygwin or similar, depending on how Leptonica was built.
https://github.com/DanBloomberg/leptonica/blob/master/src/imageio.h#L92
.--. .-. --- -..- .. -- .- / -.-. . -. - .- ..- .-. .. / -...
I've made a few in-place edits on the bug to clarify the wording. Hopefully makes more sense now.
... . -. -.. / ... .--. .- -.-. . / -- .- .-. .. -. . ...
I've made a few in-place edits on the bug to clarify the wording. Hopefully makes more sense now.
I deleted my previous message just after you made the edits. I thought that you didn't like my little joke... Clearly, I was wrong!
For the benefit of humankind, here it is again...
@jbreiden
Jeff, your last two messages look cryptic...
If you have been abducted by aliens, try give us a sign and we will rescue you! :)
<
.--. .-. --- -..- .. -- .- / -.-. . -. - .- ..- .-. .. / -...
... . -. -.. / ... .--. .- -.-. . / -- .- .-. .. -. . ...
Jeff, we are coming, stay calm!
LOL
It's good to know Morse code, or maybe just to find an online Morse code translator... :)
Even if you use -sDEVICE=tiffgray
, you might want to use -sCompression=lzw
.
You might want to use -sCompression=lzw.
I just did some simple timings on this.
The good:
time -v
).The bad:
The hmmm:
Our bottleneck on our OCR server is CPU, so it's actually preferable for us to generate big files that use less CPU than to make small files that don't. OTOH, RAM is expensive, so we'll probably be switching this out. Thanks for the suggestion!
For generating many 1-page images files, instead of one multi-page tiff file, use -o img-%d.tiff
.
My first patch (dated March 28) in bug https://github.com/tesseract-ocr/tesseract/issues/233 will reduce RAM use with TIFF input. It stops Tesseract from buffering the input file before decompression. The patch should also should make the LZW case equal to the non-LZW case with respect to RAM. Note that I haven't tested on this particular example, so I'm saying "should" rather than "does".
Jeff, Why are we not commiting your patch from March?
On 4 Oct 2016 9:30 p.m., "jbreiden" notifications@github.com wrote:
My first patch (dated March 28) in this bug #233 https://github.com/tesseract-ocr/tesseract/issues/233 will reduce RAM use in TIFF. It stops Tesseract from buffering the input file before decompression. The patch should also should make the LZW case equal to the non-LZW case with respect to RAM. Note that I haven't tested on this particular example, so I'm saying "should" rather than "does".
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/431#issuecomment-251488440, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o26v-ubHM_MSiloIl-YAaOzocpMPks5qwqlWgaJpZM4KEcsN .
The plan was for Ray to commit that patch. However, he has been too busy with the upcoming Tesseract 4.0 and over six months have passed. I think it is okay if someone wants to commit the patch. Please do not commit the second patch, though; that should wait until after the next Leptonica release.
Jeff patch was applied, so closing this issue. If the issue exists in current code, please create new issues.
Tesseract just spent seven hours trying to do OCR on the attached document. It's five pages long.
I'm fairly certain that the reason this takes so long is because of the speckling in the document. Other times when I've seen this kind of performance, it's been for similarly speckled documents.
Not sure what you can or should do about it, but since it seems to be a worst case scenario for Tesseract, I thought I'd report it.
This is on the latest version of Tesseract.
gov.uscourts.ctd.18812.88.0.pdf