Closed C0nsultant closed 3 years ago
Well, if no text is present or found in a document, OCR is necessary (assuming the file is a document with text in it), so the documentation isn't incomplete in that regard. I agree this should be explained better.
text_original is None
... bug?
Yes, and it's annoying. I've got some documents here, and I can select and copy text from them using any PDF viewer, but for some reason, the python libraries I've tried don't return any text for these. This is why this upgrading mechanic is in there in the first place.
(including
deskew
,clean_final
,force_ocr
andremove_background
)
As far as I understand the documentation, force ocr and redo ocr are mutually exclusive and two entirely different approaches to converting a document. They are not supposed to work together.
To solve the second problem, I propose a change in the parser: In case the mode is set to ot upgraded to
redo
, two passes are performed over a document. The first applies the possibly lossy transformations and all the usual options except for--redo-ocr
. That option (and only that) is performed in a second pass over the output of the first phase.
I'm also considering to just --force-ocr the document in case something goes wrong, see https://github.com/jonaswinkler/paperless-ng/issues/210#issuecomment-752050459. This will catch almost all oddities with PDF documents.
I'm also considering to just --force-ocr the document in case something goes wrong, see #210 (comment). This will catch almost all oddities with PDF documents.
That's not a bad idea and often fixes problematic files. The downside is if you have a mixed digital/scanned documents, it will lose information from the digital document by converting it to an image.
I'm just moving the discussion of many related issues into one new issue.
Well, if no text is present or found in a document, OCR is necessary (assuming the file is a document with text in it), so the documentation isn't incomplete in that regard. I agree this should be explained better.
We might have a misunderstanding here. (Please pardon my ignorance if this is all wrong.)
get_text_from_pdf()
obtains whatever OCR/digitally overlayed text is available in a consumed PDF before feeding it it OCRmyPDF.
Now there are four possibilities:
Cases 2 and 3 do not upgrade to redo
if the mode contains skip
because
get_text_from_pdf()
is not None
In case 1, get_text_from_pdf()
returns None
. In case 4, it returns either None
or some short garbage.
Case 4 requires upgrading to redo
to get those nasty documents working.
Case 1 does not, though. It just requires a normal OCRmyPDF run, but gets upgraded to redo
(since has_text == False
).
(including
deskew
,clean_final
,force_ocr
andremove_background
)As far as I understand the documentation, force ocr and redo ocr are mutually exclusive and two entirely different approaches to converting a document. They are not supposed to work together.
There probably are not many users specifying custom PAPERLESS_OCR_USER_ARGS
, but NG should still validate that the ones that are provided work. At the very least it should exclude arguments that are incompatible with a parse (i.e. remove the four above-mentioned ones when performing a redo
).
The problem is mitigated when upgrading to force
, but the user still may set PAPERLESS_OCR_MODE
to redo
.
If you don't have any objections, I will add some argument validation to RasterisedDocumentParser.parse()
that deals with this scenario.
That's a correct analysis of the basic cases.
Sometimes you have case 2+4, a digital document we'd rather preserve if possible with some nastiness appended. I see a lot of PDFs where there's a report created by Word to PDF, followed by scans of material from a third party that may or may not have OCR.
ocrmypdf's redo cannot fix all cases of bad OCR and sometimes you need to use force. For example, a document with a corrupt Unicode mapping would return mojibake, and redoing OCR won't usually help because the problem is baked into the font.
The current documentation about PAPERLESS_OCR_MODE states that
This is true for most documents, but the current tesseract parser upgrades to
redo
under certain conditions:PAPERLESS_OCR_MODE
isskip
orskip_archive
text_original is None
... bug?) or the contained OCR is deemed incorrect (len(text_original) <= 50
...)First of all, this should be documented somewhere other than the source code. I'll happily do this if you want me to do it.
In addition, this behaviour can cause problems with the used tesseract wrapper OCRmyPDF.
redo
is not currently compatible with transformations OCRmyPDF considers lossy (includingdeskew
,clean_final
,force_ocr
andremove_background
). When supplying one or more of these options inPAPERLESS_OCR_USER_ARGS
, the parser breaks and doesn't accept these documents at all. In NG, these options must explicitly set by the user, so we should consider it safe to do lossy transformations if the user asks for them. This is even less of a problem since the original ingests are stored unconditionally. Upstream issue opened at https://github.com/jbarlow83/OCRmyPDF/issues/708.To solve the second problem, I propose a change in the parser: In case the mode is set to ot upgraded to
redo
, two passes are performed over a document. The first applies the possibly lossy transformations and all the usual options except for--redo-ocr
. That option (and only that) is performed in a second pass over the output of the first phase.