Documentation about PAPERLESS_OCR_MODE is incomplete

C0nsultant commented 3 years ago

The current documentation about PAPERLESS_OCR_MODE states that

The default is skip, which only performs OCR when necessary and always creates archived documents.

This is true for most documents, but the current tesseract parser upgrades to redo under certain conditions:

PAPERLESS_OCR_MODE is skip or skip_archive
the ingested document has no OCR information available (i.e. text_original is None... bug?) or the contained OCR is deemed incorrect (len(text_original) <= 50...)

First of all, this should be documented somewhere other than the source code. I'll happily do this if you want me to do it.

In addition, this behaviour can cause problems with the used tesseract wrapper OCRmyPDF. redo is not currently compatible with transformations OCRmyPDF considers lossy (including deskew, clean_final, force_ocr and remove_background). When supplying one or more of these options in PAPERLESS_OCR_USER_ARGS, the parser breaks and doesn't accept these documents at all. In NG, these options must explicitly set by the user, so we should consider it safe to do lossy transformations if the user asks for them. This is even less of a problem since the original ingests are stored unconditionally. Upstream issue opened at https://github.com/jbarlow83/OCRmyPDF/issues/708.

To solve the second problem, I propose a change in the parser: In case the mode is set to ot upgraded to redo, two passes are performed over a document. The first applies the possibly lossy transformations and all the usual options except for --redo-ocr. That option (and only that) is performed in a second pass over the output of the first phase.

jonaswinkler commented 3 years ago

Well, if no text is present or found in a document, OCR is necessary (assuming the file is a document with text in it), so the documentation isn't incomplete in that regard. I agree this should be explained better.

text_original is None... bug?

Yes, and it's annoying. I've got some documents here, and I can select and copy text from them using any PDF viewer, but for some reason, the python libraries I've tried don't return any text for these. This is why this upgrading mechanic is in there in the first place.

(including deskew, clean_final, force_ocr and remove_background)

As far as I understand the documentation, force ocr and redo ocr are mutually exclusive and two entirely different approaches to converting a document. They are not supposed to work together.

To solve the second problem, I propose a change in the parser: In case the mode is set to ot upgraded to redo, two passes are performed over a document. The first applies the possibly lossy transformations and all the usual options except for --redo-ocr. That option (and only that) is performed in a second pass over the output of the first phase.

I'm also considering to just --force-ocr the document in case something goes wrong, see https://github.com/jonaswinkler/paperless-ng/issues/210#issuecomment-752050459. This will catch almost all oddities with PDF documents.

jbarlow83 commented 3 years ago

I'm also considering to just --force-ocr the document in case something goes wrong, see #210 (comment). This will catch almost all oddities with PDF documents.

That's not a bad idea and often fixes problematic files. The downside is if you have a mixed digital/scanned documents, it will lose information from the digital document by converting it to an image.

jonaswinkler commented 3 years ago

I'm just moving the discussion of many related issues into one new issue.

C0nsultant commented 3 years ago

Well, if no text is present or found in a document, OCR is necessary (assuming the file is a document with text in it), so the documentation isn't incomplete in that regard. I agree this should be explained better.

We might have a misunderstanding here. (Please pardon my ignorance if this is all wrong.) get_text_from_pdf() obtains whatever OCR/digitally overlayed text is available in a consumed PDF before feeding it it OCRmyPDF. Now there are four possibilities:

the document is essentially a picture stuffed into a PDF (what most scanners produce that don't do OCR)
the document is a digitally created one without any nasty edge case (encryption, etc.)
(practically identical to the previous) the document is a correct scan (i.e. a picture with perfect OCR in a PDF)
(the nasty edge case) the document contains incorrect OCR/the correct OCR is not obtainable (for whatever reason)

Cases 2 and 3 do not upgrade to redo if the mode contains skip because

the output from get_text_from_pdf() is not None
and longer than 50 chatacters.

In case 1, get_text_from_pdf() returns None. In case 4, it returns either None or some short garbage. Case 4 requires upgrading to redo to get those nasty documents working. Case 1 does not, though. It just requires a normal OCRmyPDF run, but gets upgraded to redo (since has_text == False).

(including deskew, clean_final, force_ocr and remove_background)

As far as I understand the documentation, force ocr and redo ocr are mutually exclusive and two entirely different approaches to converting a document. They are not supposed to work together.

There probably are not many users specifying custom PAPERLESS_OCR_USER_ARGS, but NG should still validate that the ones that are provided work. At the very least it should exclude arguments that are incompatible with a parse (i.e. remove the four above-mentioned ones when performing a redo). The problem is mitigated when upgrading to force, but the user still may set PAPERLESS_OCR_MODE to redo. If you don't have any objections, I will add some argument validation to RasterisedDocumentParser.parse() that deals with this scenario.

jbarlow83 commented 3 years ago

That's a correct analysis of the basic cases.

Sometimes you have case 2+4, a digital document we'd rather preserve if possible with some nastiness appended. I see a lot of PDFs where there's a report created by Word to PDF, followed by scans of material from a third party that may or may not have OCR.

ocrmypdf's redo cannot fix all cases of bad OCR and sometimes you need to use force. For example, a document with a corrupt Unicode mapping would return mojibake, and redoing OCR won't usually help because the problem is baked into the font.

jonaswinkler / paperless-ng

Documentation about PAPERLESS_OCR_MODE is incomplete #231