jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Integration with OCRmyPDF #23

Closed jonaswinkler closed 3 years ago

jonaswinkler commented 3 years ago

Add the ocr'ed text as a text layer to the scanned documents so that text can be copied from them.

jonaswinkler commented 3 years ago

See https://github.com/jbarlow83/OCRmyPDF

and https://github.com/mimimi1968/paperless

jonaswinkler commented 3 years ago

The solution should also take already existing documents into account and transform these as well.

jonaswinkler commented 3 years ago

Integration works but needs more tests and more configuration, apparently. As of now, Paperless will offer

I suppose we need an additional interface so that users can specify whatever they want.

Paperless will also store these documents in addition to the untouched originals, both for exit strategy as well as if someone decides to recreate the archived versions with different settings.

totti4ever commented 3 years ago

moved from #50:

I also have an ocrmypdf job throwing the sandwiched items straight into the current paperless application in use. I'm using the following commands:

        --output-type pdfa-2 \
        --pdfa-image-compression jpeg \
        --rotate-pages \
        --clean \
        --remove-background \
        --deskew \
        --optimize 3 \
        --skip-text \
        -l "deu" \

In casethe ocrmypdf integration into paperless-ng is supposed to be the primary pdf creator for most users, I suggest to make the arguments overwritable by some config file!

What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.

Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice? Of course I still would like to be able to throw in files without text-layer. I could imagine that they would come from another consume service, so I can disable ocrmypdf for one and enable for the other?

A lot of thoughts, sorry :-)

jonaswinkler commented 3 years ago

What you also should have in mind is to store the orignal checksum and the one adter ocrmypdf ran, otherwise it wouldn't be possible to recognize duplicates.

Good point, adding checksums for converted documents.

Any chance to recognize if there's text in the PDF at all and an option that if so ocrmypdf is skipped so I don't have all files twice?

I understand what you mean. With skip_text, we totally could skip documents that already have text in them and don't perform any calls to ocrmypdf, however, that does not account for cases where only some of the pages have text. Also, ocrmypdf still does a pretty good job at image optimization and making sure that all archived documents are in the same format, so having ocrmypdf process text-only documents is desirable.

totti4ever commented 3 years ago

That is actually a good point - might throw everything into ocrmypdf then to have a common format!

jonaswinkler commented 3 years ago

I've just merged this into dev.

And there's an option to skip storing converted documents if the original already has text.

jonaswinkler commented 3 years ago

It's in the latest release.