I'm on Windows and am using ocrmypdf under WSL with some great success.
I've been playing around with the --sidecar option to see if I can "test" a PDF for ocr-ness. I'm doing this in my Powershell script using PdfToText.exe similar to this:
foreach ($pdffile in get-childitem -recurse -filter *.pdf){
$pdftext=invoke-expression ("C:\Temp\Xpdftools\pdftotext.exe -q '"+$pdffile.fullname+"' -");
if ($pdftext -ne $null) {
if ($pdftext.ToCharArray().length -lt 20){
That works very well - I can then run the ocrmypdf command on the files that match:
wsl ocrmypdf '/mnt/c/OneDrive/scan0014.pdf' '/mnt/c/OneDrive/scan0014.pdf' --output-type pdf
However, it would be nice if I could skip the installation of pdftotext and just use ocrmypdf for everything. Is there any way to do this or would it be a feature request?
You can use ocrmypdf's API to obtain the text, which mainly relies on pdfminer.six. It's in the ocrmypdf.pdfinfo module. There are no command line bindings to access it, however.
I'm on Windows and am using ocrmypdf under WSL with some great success.
I've been playing around with the --sidecar option to see if I can "test" a PDF for ocr-ness. I'm doing this in my Powershell script using PdfToText.exe similar to this:
That works very well - I can then run the ocrmypdf command on the files that match:
wsl ocrmypdf '/mnt/c/OneDrive/scan0014.pdf' '/mnt/c/OneDrive/scan0014.pdf' --output-type pdf
However, it would be nice if I could skip the installation of pdftotext and just use ocrmypdf for everything. Is there any way to do this or would it be a feature request?