ragynotes / ragynotes.github.io

🌿 Ragy Notes 📚 https://ragynotes.github.io
https://ragynotes.github.io
1 stars 0 forks source link

How to OCR or re OCR PDFs or create PDFs #9

Open pskyhx opened 3 years ago

pskyhx commented 3 years ago

use OCRmyPDF command line utility.

Full documentation: documentation

Most useful

Batch OCRmyPDF for PDFs that have been partly OCRed and you only want to work on the pages without any existing text.

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf --skip-text '{}' '{}'

to run the same kind of job on just one specific file:

ocrmypdf --skip-text SomeFile.pdf

Batch OCRmyPDF for PDFs that have not yet been OCRed at all 

find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf '{}' '{}'

Sidecars (docs)

ocrmypdf --sidecar output.txt input.pdf output.pdf

Note

The sidecar file contains the OCR text found by OCRmyPDF. If the document contains pages that already have text, that text will not appear in the sidecar. If the option --pages is used, only those pages on which OCR was performed will be included in the sidecar. If certain pages were skipped because of options like --skip-big or --tesseract-timeout, those pages will not be in the sidecar.

To extract all text from a PDF, whether generated from OCR or otherwise, use a program like Poppler’s pdftotext or pdfgrep.