Enhancement: Report error (or emit warning) early: "ERROR - PriorOcrFoundError: page already has text! - aborting"

ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Mozilla Public License 2.0

13.01k stars 955 forks source link

$ time ocrmypdf infile.pdf out.pdf Scan: 100%|███████████████████████████████████████████████████████████| 13/13 [03:39<00:00, 16.92s/page] INFO - Start processing 4 pages concurrently OCR: 0%| | 0.0/13.0 [00:00<?, ?page/s] ERROR - PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR) real 3m41,093s user 3m39,712s sys 0m0,931s

ocrmypdf v10 or later is much faster at this early stage test, enough that I don't think it will need the early-out you propose.

In v9, which is what you have, we used to use Ghostscript to a part of the "Scan" phase, but they broke the relevant feature in a recent release. The issue is still unfixed in Ghostscript. The available quick workaround was to fallback on slow and complex pure Python text analysis, until that could be optimized and parallelized in v10.

You can use --skip-text to get an output file in the case you describe. In that case the file still gets processed and converted, just no OCR is done on pages that have OCR already.

The reason for the current behavior is that it is a part of a consistent behavior contract: we either return success and produce an output file, or exit with an error and produce no output. I appreciate what you're proposing would make sense, but I'd rather not create an exception to the error contract. We also don't presume to know what you want to do with a file that has OCR already - you might want --skip-text (your case), --force-ocr (file has corrupt fonts) or --redo-ocr (OCR done with older software needs to be updated). I suppose if the input device is a terminal we could prompt for an action.

ocrmypdf / OCRmyPDF

Enhancement: Report error (or emit warning) early: "ERROR - PriorOcrFoundError: page already has text! - aborting" #613