How to OCR or re OCR PDFs or create PDFs

use OCRmyPDF command line utility.

Full documentation: documentation

Installing OCRmyPDF

Most useful

Batch OCRmyPDF for PDFs that have been partly OCRed and you only want to work on the pages without any existing text.

in place (will not preserve previous version of PDF)
recursively (every PDF in the current directory and all directories below)
in parallel (work faster)
skip pages with existing text (see Common error messages: Page already has text for rationale and other options)

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf --skip-text '{}' '{}'

to run the same kind of job on just one specific file:

ocrmypdf --skip-text SomeFile.pdf

Batch OCRmyPDF for PDFs that have not yet been OCRed at all

in place (will not preserve previous version of PDF)
recursively (every PDF in the current directory and all directories below)
in parallel (work faster)

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf '{}' '{}'

Sidecars (docs)

ocrmypdf --sidecar output.txt input.pdf output.pdf
Note

The sidecar file contains the OCR text found by OCRmyPDF. If the document contains pages that already have text, that text will not appear in the sidecar. If the option --pages is used, only those pages on which OCR was performed will be included in the sidecar. If certain pages were skipped because of options like --skip-big or --tesseract-timeout, those pages will not be in the sidecar.

To extract all text from a PDF, whether generated from OCR or otherwise, use a program like Poppler’s pdftotext or pdfgrep.

ragynotes / ragynotes.github.io

How to OCR or re OCR PDFs or create PDFs #9