Open ferdiga opened 1 month ago
BTW running ocrmypdf --force-ocr on the big (corrupted ) file reduced the size again significantly and OCR was available again
I ran your test file and ended up with an acceptable file size increase of 15.5% instead of the dramatic increase you reported (66k -> 76k). I have the same dependency versions you do, although it looks you are using macOS and I have using Linux + Homebrew which still should be very close.
Would you mind running ocrmypdf -k -v --skip-text input.pdf output.pdf
, zipping/encrypting the generating temporary folder, and upload it here? Alternately, if you can examine the temporary folder and identify which file "blew up" in file size, that might help me figure out what happened.
Darwin Ferdi-MacBook-Air.local 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:16:51 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8103 arm64
gs --version
10.03.1
blow up: pdfa.ps (9KB) > pdfa.pdf (32.8MB) whereas the pdfa.ps seems to small to include all information.
the debug.log debug.log
and YES, CLI on Macbook is not trivial.
@ferdiga I obviously don't have access to the file, since it's encrypted. Regardless, may I suggest attempting to load pdfa.ps in aeroplane mode? If it's 9KB, I suspect it contains external assets (I don't know if that's possible for .ps files, but technically it could be a .pdf file with the wrong extension), which I think OCRMyPDF embeds if you use PDF/A
pdfa.ps just provides a sRGB ICC profile and a little metadata which Ghostscript requires and does not provide for PDF/A conversion. It's not an issue.
I confirmed that the issue is reproducible on macOS but not Linux. It's almost certainly a Ghostscript issue at this point -- some discrepancy between their Linux and macOS versions.
@ferdiga Do I have permission to share your encrypted with Artifex (Ghostscript's maintainer), or do you want to report the issue to them yourself?
Ghostscript on macOS causes a dramatic increase in file size, while Linux does not, both using 10.3.1, for your input file, with a command line like:
['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/var/folders/b7/yf4jd29d4qg_nr4cxmh__yrh0000gn/T/ocrmypdf.io.1vjrrij4/pdfa.ps', '/var/folders/b7/yf4jd29d4qg_nr4cxmh__yrh0000gn/T/ocrmypdf.io.1vjrrij4/fix_docinfo.pdf']
Provide pdfa.ps and fix_docinfo.pdf as generated by OCRmyPDF using the -k
option.
@ferdiga Do I have permission to share your encrypted with Artifex (Ghostscript's maintainer), or do you want to report the issue to them yourself?
Yes, please
I figured out what was going on. Ghostscript is working correctly under the circumstances - no need to report to Artifex.
The provided PDF does include all of its fonts and PDF/A conversion requires all fonts to be provided (PDF/A must be fully self-describing). On macOS, this means ~35 MB of fonts get inserted into the file; on Linux and elsewhere, the fonts that happen to be picked for substitution are smaller. (The macOS fonts don't look any better, fwiw. I imagine it has more to do with bundling large Unicode character sets.)
Ghostscript has an option -dNONATIVEFONTMAP
which causes Ghostscript to use its font library for font substitution, which means that the result would be consistent in file size and presentation (it will always substitute the same font), but this will reduce rendering quality when the system font is available for an exact rendering.
A further complication is that font substitution is something that I'd rather have OCRmyPDF doing automatically since it can permanently alters the presentation to an inferior rendering (as it does with the provided file, on both platforms).
I'll have to think about this issue some more but I suspect I will end up making "no font substitution" the default behavior (we abort if substitution happens) because users really should do a visual comparison of input/output, or try to install the missing font rather than rely on substitution, etc. Then I can add an option to switch on font substitution.
Describe the bug
In my case the creation of a PdfA increased the size by a multiple of 500 !!!
IMO I identified the culprit: gs can not handle mixed portrait and landscape well. After separating portrait and landscape files in 2 separate files, ocrmypdf performed extremely well and reduced the file size of each file. THIS WAS TRUE FOR ONE SET OF TYPICAL FILES - BUT NOT FOR OTHERS
A solution could be to split each file into single page files, run ocrmypdf (and hence gs) on each and put these together again? - DOES NOT SOLVE THE PROBLEM
Steps to reproduce
Files
here the json representation using qpdf --json <> Monatsbericht zum 30.06.2023-json.pdf Monatsbericht zum 30.06.2023-ocr-json.pdf
the log file ocrmypdf.log encrypted original file test.zip
shows the size after ocrmypdf
How did you download and install the software?
Homebrew
OCRmyPDF version
16.4.2
Relevant log output
No response