Closed guldil closed 6 years ago
ocrmypdf --skip-text
skips pages that already have text and should do what you need on its own. That function already determines on a page by page basis whether OCR is required.
If you have a file where text is not being detected properly try adding on PyMuPDF (pip install ocrmypdf[fitz]
) to see if that fixes the issue. PyMuPDF is optional and difficult to install on some platforms but is better at detecting text than OCRmyPDF's fallback algorithm.
I'd like to see the file is possible.
I forgot to give you one information, all the pages have "page number as text" so the --skip-text not working (there is no OCR on page with only scanned image because there is a little number on the bottom...). this PDF are created from multiples sources and the creator put page numbering :(
My pdftotext is only to calculate the lenght of text and to decide or not to apply force-ocr.
I can't give you a sample but i'll try to create one.
I suggest customizing pdfinfo.py
, in the function _page_has_text()
if using PyMuPDF. You could do the length check here.
Without PyMuPDF, ocrmypdf only knows how to look for text showing operators ("probably text") but can't actually retrieve the text.
i tried this :
def _page_has_text(infile, pageno):
doc = fitz.Document(infile)
text = doc.getPageText(pageno)
if text.strip() != '' and len(text) >= 10:
return True
return False
then
ocrmypdf --skip-text --rotate-pages -l fra+eng --deskew --clean --tesseract-timeout 3000 --skip-big 50 input.pdf output-pdfa.pdf
It's working :)
May be you could had this option ? --skip-text-minimum-length
Maybe? I have a feeling the next user with a similar problem will need a slightly more complex test for some reason, so the right thing to do is expose this function it to the user to customize in a script. But that means I have to import ocrmypdf
work, which as described in other open bugs, doesn't work and isn't easy to resolve.
Thanks a lot it's working :)
Further to this - if a page contains a small amount of visible text, an implementation like --skip-text-minimum-length
will duplicate text in the output file when the amount of text is below minimum length. For example, if each scanned page has a digitally inserted watermark for the page number, then the output content stream will have the digital page number and the OCR'ed representation of the same number. This might cause problems for some viewers especially if the OCR happens to not match the printed text.
The solution may be to create a raster image of the page with all of the text removed.
Hi,
My problem is similar to problem in the introduction post. I have some pdf files which are completely searchable, some that are no searchable at all, and some that are partially searchable. If I apply --skip-text, some pdf files want be ocr'ed even it it would be more efficient they are, vice versa.
Is there any algorithm I can apply on pdfs that can estimate some kind of score (percentage) of "searchability" of pdf file?
I'm working on improving this area at the moment and should have something out soon (as in days, not weeks) that will look for additional content to OCR without removing existing OCR.
The quick fix is to use --force-ocr which just rasterizes and OCRs everything, but that's often not desirable because of the quality loss, accuracy loss, and file size increase.
There is no standard algorithm to estimate searchability. A simple solution would be to write a little parser in github.com/pikepdf/pikepdf that checks how many pages have any text objects on them.
Thanks, I will wait for the update. Hope docker container will be updated also :)
You are doing great work.
@jbarlow83 works perfect. had to work with some messed up ppt-files with website-screenshots and other images containing text. it was very easy to make the document searchable through --redo-ocr
Thank you so much. your solution came in the right moment :)
Old but gold.
Hi,
i have to ocr mixed content PDF, example : 100 pages with vector text and shapes, then 100 pages with only image (from scan). If i force OCR i loose quality from layer so i decide to script like this :
is there a way to do this better, because it's a bit slow :( I see that tesseract use only one thread on my 4 core VM. Mayb be tesseract 4.0 will do better ?
I have to compile everything from source (ubuntu 16.04 package are too old and i have got errors on some PDF).
Thanks
Guldil