Closed homocomputeris closed 1 year ago
The ratio reported is the amount saved by the optimization step in the processing pipeline, rather than savings for the file as a whole. This helps explain how much optimization helped, but perhaps this isn't the most intuitive explanation. Sometimes adding OCR (and some settings like oversample or force-ocr) increase the file size. While I don't want to bog down the user with details, perhaps it would be better to explain the whole picture. "The input file was this size, after OCR and PDF/A conversion it grew to this size, after optimization it's this size. Here's you're file, return 0."
In most cases, the size of the intermediate file sent to optimization is very similar to the original input file. But in your case, Ghostscript in its wisdom or folly, ran into some issue with a data stream in the input PDF and seems to have discarded it, and that action is responsible for most of the savings. This a case where I'd very carefully inspect the input and output PDFs to see if anything important is missing.
It doesn't seem to be related to Ghostscript error. Take this file as an example: https://media.canon-asia.com/shared/live/products/EN/Canon-iR1435-Brochure.pdf
ocrmypdf --tesseract-timeout=0 --optimize 2 --skip-text Canon-iR1435-Brochure.pdf opt.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████| 4/4 [00:01<00:00, 3.61page/s]
Start processing 4 pages concurrently
1 skipping all processing on this page
2 skipping all processing on this page
3 skipping all processing on this page
4 skipping all processing on this page
Image processing: 100%|█████████████████████████████████████████████████| 4.0/4.0 [00:00<00:00, 1183.41page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.49page/s]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
Recompressing JPEGs: 100%|███████████████████████████████████████████████████| 7/7 [00:00<00:00, 18.09image/s]
Deflating JPEGs: 100%|███████████████████████████████████████████████████████| 7/7 [00:00<00:00, 94.66image/s]
PNGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.15 savings: 13.2%
Output file is a PDF/A-2B (as expected)
then
ls -l
-rw-r--r--@ 1 user staff 2353187 Feb 25 10:47 Canon-iR1435-Brochure.pdf
-rw-r--r-- 1 user staff 1674421 Feb 25 10:48 opt.pdf
and 2353187/1674421 = 1.41.
If optimization step saves 13%, how does file become even smaller (overall savings about 40%)? Is there optimization afterwards?
Okay, it does become smaller for some reason. But what happens in this case?
ocrmypdf --tesseract-timeout=0 --optimize 0 --skip-text Canon-iR1435-Brochure.pdf opt.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████| 4/4 [00:01<00:00, 3.58page/s]
Start processing 4 pages concurrently
1 skipping all processing on this page
2 skipping all processing on this page
3 skipping all processing on this page
4 skipping all processing on this page
Image processing: 100%|█████████████████████████████████████████████████| 4.0/4.0 [00:00<00:00, 1407.96page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00, 6.56page/s]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)
then
ls -l
-rw-r--r--@ 1 user staff 2353187 Feb 25 10:47 Canon-iR1435-Brochure.pdf
-rw-r--r-- 1 user staff 1932656 Feb 25 11:02 opt.pdf
@homocomputeris Different visual content of input / output PDF explain the compression in size.
But it is the main trouble...
Input
Output
@homocomputeris Different visual content of input / output PDF explain the compression in size.
But in my 2nd example I set --optimization 0. Why does the quality get worse?
On a related note, the file size ratio seems to be inverted. For example, I get Total file size ratio: 0.21 savings: -374.0%
, which does not match. Savings of more than 100% are mathematically not possible, it seems to be calculated as $\text{old size} \over \text{new size}$ instead of $1- {\text{new size} \over \text{old size}}$.
Unfortunately, I am not able to check whether this is correct in the latest release, as I am on Fedora and only have access to version 15.4.3 through its sources. Thus, I decided to high jack this issue. If someone confirms that this remains an issue in the latest version, I will open a separate issue.
Describe the bug ocrmypdf successfully optimizes PDFs (not this one specifically but others I've tried too) but shows wrong ratio and savings.
Or maybe a wording that doesn't match the output. I understand the ratio as 'old size by new size' and savings as 'how much storage can be saved in % compared to the original file'
To Reproduce
Expected behavior In this particular case above, I'd expect:
Screenshots If applicable, add screenshots to help explain your problem.
System (please complete the following information):
Installation
brew install ocrmypdf