ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.01k stars 955 forks source link

Enhancement: Report error (or emit warning) early: "ERROR - PriorOcrFoundError: page already has text! - aborting" #613

Open klartext opened 3 years ago

klartext commented 3 years ago

It took more than 3 minutes for a pdf until the message that the pdf already contains ocr-information has been emitted. So, ocrmypdf first scanned the pdf, and THEN found out, that it should not create output.

ocrmypdf infile.pdf out.pdf

There are two things to mention here:

If ocrmypdf prefers NOT to create output, when the infile.pdf already has been ocr'ed, it should report this early - before scanning, which takes much time.

The outputfile is different from the input-file (no in-place modification), so I see no reason NOT to create an output file. The only reason would be: save time -> then early report it. For in-place-modification it would make sense NOT to overwrite. But that was not the case here.

$ time ocrmypdf infile.pdf out.pdf
Scan: 100%|███████████████████████████████████████████████████████████| 13/13 [03:39<00:00, 16.92s/page]
   INFO - Start processing 4 pages concurrently
OCR:   0%|                                                                 | 0.0/13.0 [00:00<?, ?page/s]
  ERROR - PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR)

real    3m41,093s
user    3m39,712s
sys     0m0,931s
jbarlow83 commented 3 years ago

ocrmypdf v10 or later is much faster at this early stage test, enough that I don't think it will need the early-out you propose.

In v9, which is what you have, we used to use Ghostscript to a part of the "Scan" phase, but they broke the relevant feature in a recent release. The issue is still unfixed in Ghostscript. The available quick workaround was to fallback on slow and complex pure Python text analysis, until that could be optimized and parallelized in v10.

You can use --skip-text to get an output file in the case you describe. In that case the file still gets processed and converted, just no OCR is done on pages that have OCR already.

The reason for the current behavior is that it is a part of a consistent behavior contract: we either return success and produce an output file, or exit with an error and produce no output. I appreciate what you're proposing would make sense, but I'd rather not create an exception to the error contract. We also don't presume to know what you want to do with a file that has OCR already - you might want --skip-text (your case), --force-ocr (file has corrupt fonts) or --redo-ocr (OCR done with older software needs to be updated). I suppose if the input device is a terminal we could prompt for an action.