Possibility to leave PDF 'look' as it was, only adding OCR text

Kors1981 commented 2 years ago

Problem: in case I use any imageprocessing(in my case removebackground option at the moment) function, output pdf look will change. Solution: use original picture in pdf and only apply new OCR-ed text layer on it.

jbarlow83 commented 2 years ago

It is possible to do this with a custom plugin that overrides this function and calls the Leptonica remove background function https://ocrmypdf.readthedocs.io/en/latest/plugins.html#ocrmypdf.pluginspec.filter_ocr_image As explained on that page, there is an "OCR image" (what we show to Tesseract) and possibly a "presentation image" (what the user will see in the PDF, if we are using image preprocessing). We can change the OCR image without affecting the presentation image.

I'm reluctant to make changes that invest more in Leptonica - I'm hoping to replace that dependency with a better image library.

Kors1981 commented 2 years ago

I've placed the example code provided for plugins to the ocrmypdf_example_plugin.py , and got the error as seen on image below. I have pip installed and ocrmypdf version 11.7.3. Can i found anywhere a working example plugin? Thanks a lot in advance!

jbarlow83 commented 2 years ago

In the screenshot you have from ocrmpydf not from ocrmypdf... Closing issue due to age

ocrmypdf / OCRmyPDF

Possibility to leave PDF 'look' as it was, only adding OCR text #803