ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.23k stars 1.02k forks source link

Add custom deskew and page rotation logic before OCR #1161

Open wadeflash12 opened 1 year ago

wadeflash12 commented 1 year ago

What were you trying to do?

Hi! Thanks for the great library. I apologize for the bug tag. I cannot change it.

I have implemented custom logic for deskew (using Wand) and rotation. How can I modify the images before it enters the OCR stage, and have the changed images on the output pdf file as well? I see a couple plugins that almost work but none of them preserves state of image before ocr and after. Maybe I’m wrong.

I’m aware I could just prepare an intermediary pdf with my changes and send that in to OCRmypdf but wondering of any streamlined ways you may suggest.

My use case is processing lots of pdfs in a backend server.

Thank you.

Where are you installing from?

PyPI (pip, poetry, pipx, etc.)

What operating system are you working on?

Linux, Windows

Relevant log output

No response

jbarlow83 commented 1 year ago

This will be a bit awkward. The trouble is, the plugin system doesn't actually support replacement of some steps in the pipeline that ought to be pluggable, mainly because I never got around to making those steps pluggable.

What you'll need to do is create a hook that replaces filter_ocr_image . Then run ocrmypdf --deskew with your plugin installed. When --deskew is set, ocrmypdf will change the output PDF.

ocrmypdf calls its built in preprocessing (deskew, clean), and then calls filter_ocr_image before passing to OCR, and filter_page_image before inserting into a PDF (provided that preprocessing is selected and we need to update the PDF). So the idea is that you run ocrmypdf's deskew anyway, but then replace the deskewed image as generated by ocrmypdf with your own image.

ocrmypdf creates intermediate files named {page number}_rasterize.png, _pp_deskewed.png, _pp_clean.png (some of which may be symlinks if features are disabled). Your filter_ocr_image would update _rasterize.png, pp_deskewed, pp_clean. This is all a bit of hack based on how things currently work, and some of might change in the future.

The right way to go about would be to create a plugin hook for deskew, and a builtin plugin that installs the current version, allowing replacmenet.

For (cardinal angle) rotation, you create a custom OCR engine that that replaces get_orientation() with your custom rotation determination. I don't recommend looking at the details of how rotation is tracked because it's quite complicated. Makes your head spin.

wadeflash12 commented 1 year ago

Thanks for the reply! Definitely a hack!

I replaced my deskew logic within the function filter_ocr_image and return the deskewed PIL image and tried the following command:

ocrmypdf.ocr(input_filename,output_filename, plugins=['/content/example_plugin.py'])

With or without the deskew = True command, it seems that function filter_ocr_image is still accessed (verified through console log).

The resulting page (with or without deskew = True) correctly contains the text layer reflecting deskew but the page layer is skewed as before. So it seems like the plugin doesn't maintain the state after the OCR?

jbarlow83 commented 1 year ago

Are you sure your plugin is being executed? (Maybe have it raise an exception to confirm.)

Other than that, maybe it's better to avoid the hacks and make deskew properly pluggable.

wadeflash12 commented 1 year ago

Yes, with a skewed pdf as input, plugin executes and presents a pdf with text layer deskewed and page/image layer still skewed. If I remove the plugin, the extracted text layer is skewed along with the page/image layer.

So the plugin deskews the input to the OCR as intended, and OCR extracts text and overlays it. But doesn't retain page layer state for output pdf.

wadeflash12 commented 1 year ago

@jbarlow83 I took your advice and avoided this hack entirely and am processing as usual after my rotation/skew logic as a preprocessing step.

Any advice on reducing file size after OCR? Can I downsample images after OCR text has been generated to lower dpi/quality and overlay text layer on them?

jbarlow83 commented 1 year ago

OCRmyPDF already has an optimizer. First use -O3 then lower JPEG and PNG quality with those options. Lossy compression is equivalent to a "smart downsampling" rather than degrading the whole image. I'd try that before downsampling. You can get both quality measures surprisingly low but still readable and usable.

wadeflash12 commented 1 year ago

Great. Thank you. I actually used the filter_page_image hook to try to adjust image quality but you seem to thought of that and included them as a parameter. :) Will try optimize settings.

To confirm, jpeg_quality and png_quality parameters only are applied after OCR/text extraction, right? So they don't affect quality of text extracted.

jbarlow83 commented 1 year ago

To confirm...

Correct