ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

Way to test PDF to see if there is any text? #1043

Closed spedinfargo closed 1 year ago

spedinfargo commented 1 year ago

I'm on Windows and am using ocrmypdf under WSL with some great success.

I've been playing around with the --sidecar option to see if I can "test" a PDF for ocr-ness. I'm doing this in my Powershell script using PdfToText.exe similar to this:

foreach ($pdffile in get-childitem -recurse -filter *.pdf){
    $pdftext=invoke-expression ("C:\Temp\Xpdftools\pdftotext.exe -q '"+$pdffile.fullname+"' -");
    if ($pdftext -ne $null) {
        if ($pdftext.ToCharArray().length -lt 20){

That works very well - I can then run the ocrmypdf command on the files that match:

wsl ocrmypdf '/mnt/c/OneDrive/scan0014.pdf' '/mnt/c/OneDrive/scan0014.pdf' --output-type pdf

However, it would be nice if I could skip the installation of pdftotext and just use ocrmypdf for everything. Is there any way to do this or would it be a feature request?

jbarlow83 commented 1 year ago

You can use ocrmypdf's API to obtain the text, which mainly relies on pdfminer.six. It's in the ocrmypdf.pdfinfo module. There are no command line bindings to access it, however.