ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.54k stars 989 forks source link

--redo-ocr not currently compatible with lossy transformations #708

Closed C0nsultant closed 3 years ago

C0nsultant commented 3 years ago

Describe the scenario paperless-ng uses OCRmyPDF to perform text recognition. By default, no lossy transformations are applied. Also by default, some documents are parsed using --redo-ocr. I would like to use --deskew since the ADF on my printer generally does not properly align the documents. In paperless-ng, using lossy transformations is not a big concern since the original documents are stored alongside the output of OCRmyPDF. Using both options results in

--redo-ocr is not currently compatible with --deskew, --clean-final, and --remove-background

To Reproduce

ocrmypdf --deskew --redo-ocr input.pdf output.pdf

Expected behavior If the current validation error happens purely because the lossy transformation is considered unsafe, an option to explicitly allow it would suffice. If this behaviour is caused by something else, let's discuss.

jbarlow83 commented 3 years ago

If possible this should be resolved within paperless-ng. There is no easy way to make those features compatible.

redo-ocr does "surgery" on an existing PDF's content stream, removing text that appears to be related to OCR, and grafting on a layer of invisible OCR text. redo-ocr is probably the wrong option for a scanned image PDF.

jbarlow83 commented 3 years ago

I'll close the issue now. If you have further related questions feel free to reopen it.