Closed jonaswinkler closed 3 years ago
The solution should also take already existing documents into account and transform these as well.
Integration works but needs more tests and more configuration, apparently. As of now, Paperless will offer
I suppose we need an additional interface so that users can specify whatever they want.
Paperless will also store these documents in addition to the untouched originals, both for exit strategy as well as if someone decides to recreate the archived versions with different settings.
moved from #50:
I also have an ocrmypdf
job throwing the sandwiched items straight into the current paperless application in use. I'm using the following commands:
--output-type pdfa-2 \
--pdfa-image-compression jpeg \
--rotate-pages \
--clean \
--remove-background \
--deskew \
--optimize 3 \
--skip-text \
-l "deu" \
In casethe ocrmypdf integration into paperless-ng is supposed to be the primary pdf creator for most users, I suggest to make the arguments overwritable by some config file!
What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.
Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice? Of course I still would like to be able to throw in files without text-layer. I could imagine that they would come from another consume service, so I can disable ocrmypdf for one and enable for the other?
A lot of thoughts, sorry :-)
What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.
Good point, adding checksums for converted documents.
Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?
I understand what you mean. With skip_text, we totally could skip documents that already have text in them and don't perform any calls to ocrmypdf, however, that does not account for cases where only some of the pages have text. Also, ocrmypdf still does a pretty good job at image optimization and making sure that all archived documents are in the same format, so having ocrmypdf process text-only documents is desirable.
That is actually a good point - might throw everything into ocrmypdf then to have a common format!
I've just merged this into dev.
And there's an option to skip storing converted documents if the original already has text.
It's in the latest release.
Add the ocr'ed text as a text layer to the scanned documents so that text can be copied from them.