ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.63k stars 997 forks source link

[Bug]: Large file size increases due to PDF/A font substitution #1369

Open ferdiga opened 1 month ago

ferdiga commented 1 month ago

Describe the bug

In my case the creation of a PdfA increased the size by a multiple of 500 !!!

Steps to reproduce

1. Run ocrmypdf -v --skip-text input.pdf output.pdf
BTW I tried many other parameters - output all about the same size
gs took minutes to create the multi MB files.

Files

here the json representation using qpdf --json <> Monatsbericht zum 30.06.2023-json.pdf Monatsbericht zum 30.06.2023-ocr-json.pdf

the log file ocrmypdf.log encrypted original file test.zip

shows the size after ocrmypdf

20240803 100232 ocr_test

How did you download and install the software?

Homebrew

OCRmyPDF version

16.4.2

Relevant log output

No response

ferdiga commented 1 month ago

BTW running ocrmypdf --force-ocr on the big (corrupted ) file reduced the size again significantly and OCR was available again

jbarlow83 commented 1 month ago

I ran your test file and ended up with an acceptable file size increase of 15.5% instead of the dramatic increase you reported (66k -> 76k). I have the same dependency versions you do, although it looks you are using macOS and I have using Linux + Homebrew which still should be very close.

Would you mind running ocrmypdf -k -v --skip-text input.pdf output.pdf, zipping/encrypting the generating temporary folder, and upload it here? Alternately, if you can examine the temporary folder and identify which file "blew up" in file size, that might help me figure out what happened.

ferdiga commented 1 month ago

Darwin Ferdi-MacBook-Air.local 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:16:51 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8103 arm64

gs --version
10.03.1

blow up: pdfa.ps (9KB) > pdfa.pdf (32.8MB) whereas the pdfa.ps seems to small to include all information.

20240805 111738 ocrmypdf io 1vjrrij4

the debug.log debug.log

and YES, CLI on Macbook is not trivial.

gamer191 commented 1 month ago

@ferdiga I obviously don't have access to the file, since it's encrypted. Regardless, may I suggest attempting to load pdfa.ps in aeroplane mode? If it's 9KB, I suspect it contains external assets (I don't know if that's possible for .ps files, but technically it could be a .pdf file with the wrong extension), which I think OCRMyPDF embeds if you use PDF/A

jbarlow83 commented 1 month ago

pdfa.ps just provides a sRGB ICC profile and a little metadata which Ghostscript requires and does not provide for PDF/A conversion. It's not an issue.

jbarlow83 commented 1 month ago

I confirmed that the issue is reproducible on macOS but not Linux. It's almost certainly a Ghostscript issue at this point -- some discrepancy between their Linux and macOS versions.

jbarlow83 commented 1 month ago

@ferdiga Do I have permission to share your encrypted with Artifex (Ghostscript's maintainer), or do you want to report the issue to them yourself?

Ghostscript on macOS causes a dramatic increase in file size, while Linux does not, both using 10.3.1, for your input file, with a command line like:

['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/var/folders/b7/yf4jd29d4qg_nr4cxmh__yrh0000gn/T/ocrmypdf.io.1vjrrij4/pdfa.ps', '/var/folders/b7/yf4jd29d4qg_nr4cxmh__yrh0000gn/T/ocrmypdf.io.1vjrrij4/fix_docinfo.pdf']

Provide pdfa.ps and fix_docinfo.pdf as generated by OCRmyPDF using the -k option.

ferdiga commented 1 month ago

@ferdiga Do I have permission to share your encrypted with Artifex (Ghostscript's maintainer), or do you want to report the issue to them yourself?

Yes, please

jbarlow83 commented 1 month ago

I figured out what was going on. Ghostscript is working correctly under the circumstances - no need to report to Artifex.

The provided PDF does include all of its fonts and PDF/A conversion requires all fonts to be provided (PDF/A must be fully self-describing). On macOS, this means ~35 MB of fonts get inserted into the file; on Linux and elsewhere, the fonts that happen to be picked for substitution are smaller. (The macOS fonts don't look any better, fwiw. I imagine it has more to do with bundling large Unicode character sets.)

Ghostscript has an option -dNONATIVEFONTMAP which causes Ghostscript to use its font library for font substitution, which means that the result would be consistent in file size and presentation (it will always substitute the same font), but this will reduce rendering quality when the system font is available for an exact rendering.

A further complication is that font substitution is something that I'd rather have OCRmyPDF doing automatically since it can permanently alters the presentation to an inferior rendering (as it does with the provided file, on both platforms).

I'll have to think about this issue some more but I suspect I will end up making "no font substitution" the default behavior (we abort if substitution happens) because users really should do a visual comparison of input/output, or try to install the missing font rather than rely on substitution, etc. Then I can add an option to switch on font substitution.