ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

Best way to handle PDF with mixed content ? #258

Closed guldil closed 6 years ago

guldil commented 6 years ago

Hi,

i have to ocr mixed content PDF, example : 100 pages with vector text and shapes, then 100 pages with only image (from scan). If i force OCR i loose quality from layer so i decide to script like this :

is there a way to do this better, because it's a bit slow :( I see that tesseract use only one thread on my 4 core VM. Mayb be tesseract 4.0 will do better ?

I have to compile everything from source (ubuntu 16.04 package are too old and i have got errors on some PDF).

tesseract --version tesseract 3.05.01 leptonica-1.75.3 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

ocrmypdf --version 6.1.5 gs -v GPL Ghostscript 9.23 (2018-03-21) Copyright (C) 2018 Artifex Software, Inc. All rights reserved. qpdf --version qpdf version 8.0.2 Run qpdf --copyright to see copyright and license information. pdftotext -v pdftotext version 0.64.0 Copyright 2005-2018 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC

Thanks

Guldil

jbarlow83 commented 6 years ago

ocrmypdf --skip-text skips pages that already have text and should do what you need on its own. That function already determines on a page by page basis whether OCR is required.

If you have a file where text is not being detected properly try adding on PyMuPDF (pip install ocrmypdf[fitz]) to see if that fixes the issue. PyMuPDF is optional and difficult to install on some platforms but is better at detecting text than OCRmyPDF's fallback algorithm.

I'd like to see the file is possible.

guldil commented 6 years ago

I forgot to give you one information, all the pages have "page number as text" so the --skip-text not working (there is no OCR on page with only scanned image because there is a little number on the bottom...). this PDF are created from multiples sources and the creator put page numbering :(

My pdftotext is only to calculate the lenght of text and to decide or not to apply force-ocr.

I can't give you a sample but i'll try to create one.

jbarlow83 commented 6 years ago

I suggest customizing pdfinfo.py, in the function _page_has_text() if using PyMuPDF. You could do the length check here.

Without PyMuPDF, ocrmypdf only knows how to look for text showing operators ("probably text") but can't actually retrieve the text.

guldil commented 6 years ago

i tried this :

def _page_has_text(infile, pageno): doc = fitz.Document(infile) text = doc.getPageText(pageno) if text.strip() != '' and len(text) >= 10: return True return False

then

ocrmypdf --skip-text --rotate-pages -l fra+eng --deskew --clean --tesseract-timeout 3000 --skip-big 50 input.pdf output-pdfa.pdf

It's working :)

May be you could had this option ? --skip-text-minimum-length

jbarlow83 commented 6 years ago

Maybe? I have a feeling the next user with a similar problem will need a slightly more complex test for some reason, so the right thing to do is expose this function it to the user to customize in a script. But that means I have to import ocrmypdf work, which as described in other open bugs, doesn't work and isn't easy to resolve.

jbarlow83 commented 6 years ago

85 discusses APIs

guldil commented 6 years ago

Thanks a lot it's working :)

jbarlow83 commented 6 years ago

Further to this - if a page contains a small amount of visible text, an implementation like --skip-text-minimum-length will duplicate text in the output file when the amount of text is below minimum length. For example, if each scanned page has a digitally inserted watermark for the page number, then the output content stream will have the digital page number and the OCR'ed representation of the same number. This might cause problems for some viewers especially if the OCR happens to not match the printed text.

The solution may be to create a raster image of the page with all of the text removed.

MislavSag commented 6 years ago

Hi,

My problem is similar to problem in the introduction post. I have some pdf files which are completely searchable, some that are no searchable at all, and some that are partially searchable. If I apply --skip-text, some pdf files want be ocr'ed even it it would be more efficient they are, vice versa.

Is there any algorithm I can apply on pdfs that can estimate some kind of score (percentage) of "searchability" of pdf file?

jbarlow83 commented 6 years ago

I'm working on improving this area at the moment and should have something out soon (as in days, not weeks) that will look for additional content to OCR without removing existing OCR.

The quick fix is to use --force-ocr which just rasterizes and OCRs everything, but that's often not desirable because of the quality loss, accuracy loss, and file size increase.

There is no standard algorithm to estimate searchability. A simple solution would be to write a little parser in github.com/pikepdf/pikepdf that checks how many pages have any text objects on them.

MislavSag commented 6 years ago

Thanks, I will wait for the update. Hope docker container will be updated also :)

You are doing great work.

a22sc commented 5 years ago

@jbarlow83 works perfect. had to work with some messed up ppt-files with website-screenshots and other images containing text. it was very easy to make the document searchable through --redo-ocr

Thank you so much. your solution came in the right moment :)

thyarles commented 2 months ago

Old but gold.