ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.08k stars 1.02k forks source link

[BUG] ghostscript fails due to small resolution value #1102

Open neurolabs opened 1 year ago

neurolabs commented 1 year ago

Describe the bug When calling ocrmypdf 14.2.0 on the example file, ghostscript gets called with the resolution parameter set to -r1.209464x1.209464, which leads to an error Unrecoverable error: rangecheck in setscreen. If I call ghostscript with a higher resolution setting manually (e.g. 100x100), ghostscript succeeds.

To Reproduce

docker run --rm -i ocrmypdf:v14.2.0 -v1 -k - - <blank.pdf

Example file blank.pdf

Expected behavior ocrmypdf should not call ghostscript with resolution parameters that make ghostscript fail

System

neurolabs commented 1 year ago

Excuse me asking. Do you have any idea yet whether this should be fixed in the codebase or whether it's a wont-fix in your opinion?

Some more background: I discovered this issue while feeding a real world pdf to https://github.com/paperless-ngx/paperless-ngx , and from my point of view, tackling this issue in OCRmyPDF makes the most sense.

jbarlow83 commented 1 year ago

It can and should be fixed in ocrmypdf, but I'm short on time.

This is a superficially easy fix. It's not hard to force a lower limit on resolution.

It's more difficult to find out why the resolution comes out low for that PDF, if our calculation of resolution is wrong, if the PDF is malformed, or if there are cases where resolution is legitimately low and keeping it low is the right decision.

You're welcome to take a stab at it.

neurolabs commented 1 year ago

Thanks for the clarification. If the moons align, I might poke at it, but I'm also short on time.