The sidecar file contains the OCR text found by OCRmyPDF. If the document contains pages that already have text, that text will not appear in the sidecar. If the option --pages is used, only those pages on which OCR was performed will be included in the sidecar. If certain pages were skipped because of options like --skip-big or --tesseract-timeout, those pages will not be in the sidecar.
To extract all text from a PDF, whether generated from OCR or otherwise, use a program like Poppler’s pdftotext or pdfgrep.
use
OCRmyPDF
command line utility.Full documentation: documentation
Most useful
Batch
OCRmyPDF
for PDFs that have been partly OCRed and you only want to work on the pages without any existing text.to run the same kind of job on just one specific file:
Batch
OCRmyPDF
for PDFs that have not yet been OCRed at allSidecars (docs)