ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.19k stars 1.02k forks source link

[BUG] Wrong optimize ratio and savings #1070

Closed homocomputeris closed 1 year ago

homocomputeris commented 1 year ago

Describe the bug ocrmypdf successfully optimizes PDFs (not this one specifically but others I've tried too) but shows wrong ratio and savings.

Or maybe a wording that doesn't match the output. I understand the ratio as 'old size by new size' and savings as 'how much storage can be saved in % compared to the original file'

To Reproduce

user@ntnu ~/Desktop % ll
total 88576
-rw-r--r--@ 1 user  staff  45093889 Feb  2 20:10 sample.pdf

user@ntnu ~/Desktop % ocrmypdf --tesseract-timeout=0 --optimize 2 --skip-text sample.pdf opt.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.81page/s]
Start processing 3 pages concurrently
    1 skipping all processing on this page                                                                    
    2 skipping all processing on this page                                                                    
    3 skipping all processing on this page                                                                    
Image processing: 100%|█████████████████████████████████████████████████| 3.0/3.0 [00:00<00:00, 1225.57page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.10page/s]
GPL Ghostscript 10.0.0 (2022-09-21)
Copyright (C) 2022 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 3.
Page 1
GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

Page 2
GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

Page 3
GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
GPL Ghostscript 10.00.0: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?
Attempting to write a DeviceN space with an inappropriate alternate,
have you set ColorConversionStrategy ?

The following errors were encountered at least once while processing this file:
    error reading a stream

 This file had errors that were repaired or ignored.

 The file was produced by: 

 >>>> Adobe PDF library 9.90 <<<<

 Please notify the author of the software that produced this

 file that it does not conform to Adobe's published PDF

 specification.

Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
Recompressing JPEGs: 100%|█████████████████████████████████████████████████| 13/13 [00:00<00:00, 76.52image/s]
Deflating JPEGs: 100%|████████████████████████████████████████████████████| 13/13 [00:00<00:00, 289.04image/s]
PNGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.11 savings: 10.0%
Output file is a PDF/A-2B (as expected)
ocrmypdf --tesseract-timeout=0 --optimize 2 --skip-text sample.pdf opt.pdf  9.49s user 0.74s system 99% cpu 10.315 total

user@ntnu ~/Desktop % ll
total 94872
-rw-r--r--  1 user  staff   3223084 Feb  2 20:11 opt.pdf
-rw-r--r--@ 1 user  staff  45093889 Feb  2 20:10 sample.pdf

Expected behavior In this particular case above, I'd expect:

Optimize ratio: 13.99 savings: 92.85%

Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

Installation brew install ocrmypdf

jbarlow83 commented 1 year ago

The ratio reported is the amount saved by the optimization step in the processing pipeline, rather than savings for the file as a whole. This helps explain how much optimization helped, but perhaps this isn't the most intuitive explanation. Sometimes adding OCR (and some settings like oversample or force-ocr) increase the file size. While I don't want to bog down the user with details, perhaps it would be better to explain the whole picture. "The input file was this size, after OCR and PDF/A conversion it grew to this size, after optimization it's this size. Here's you're file, return 0."

In most cases, the size of the intermediate file sent to optimization is very similar to the original input file. But in your case, Ghostscript in its wisdom or folly, ran into some issue with a data stream in the input PDF and seems to have discarded it, and that action is responsible for most of the savings. This a case where I'd very carefully inspect the input and output PDFs to see if anything important is missing.

homocomputeris commented 1 year ago

It doesn't seem to be related to Ghostscript error. Take this file as an example: https://media.canon-asia.com/shared/live/products/EN/Canon-iR1435-Brochure.pdf

ocrmypdf --tesseract-timeout=0 --optimize 2 --skip-text Canon-iR1435-Brochure.pdf opt.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.61page/s]
Start processing 4 pages concurrently
    1 skipping all processing on this page                                                                    
    2 skipping all processing on this page                                                                    
    3 skipping all processing on this page                                                                    
    4 skipping all processing on this page                                                                    
Image processing: 100%|█████████████████████████████████████████████████| 4.0/4.0 [00:00<00:00, 1183.41page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.49page/s]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
Recompressing JPEGs: 100%|███████████████████████████████████████████████████| 7/7 [00:00<00:00, 18.09image/s]
Deflating JPEGs: 100%|███████████████████████████████████████████████████████| 7/7 [00:00<00:00, 94.66image/s]
PNGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.15 savings: 13.2%
Output file is a PDF/A-2B (as expected)

then

ls -l
-rw-r--r--@  1 user  staff   2353187 Feb 25 10:47 Canon-iR1435-Brochure.pdf
-rw-r--r--   1 user  staff   1674421 Feb 25 10:48 opt.pdf

and 2353187/1674421 = 1.41.

If optimization step saves 13%, how does file become even smaller (overall savings about 40%)? Is there optimization afterwards?

homocomputeris commented 1 year ago

Okay, it does become smaller for some reason. But what happens in this case?

ocrmypdf --tesseract-timeout=0 --optimize 0 --skip-text Canon-iR1435-Brochure.pdf opt.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.58page/s]
Start processing 4 pages concurrently
    1 skipping all processing on this page                                                                    
    2 skipping all processing on this page                                                                    
    3 skipping all processing on this page                                                                    
    4 skipping all processing on this page                                                                    
Image processing: 100%|█████████████████████████████████████████████████| 4.0/4.0 [00:00<00:00, 1407.96page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.56page/s]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

then

ls -l
-rw-r--r--@  1 user  staff   2353187 Feb 25 10:47 Canon-iR1435-Brochure.pdf
-rw-r--r--   1 user  staff   1932656 Feb 25 11:02 opt.pdf
macdeport commented 1 year ago

@homocomputeris Different visual content of input / output PDF explain the compression in size.

But it is the main trouble...

Input screenshot-153159

Output screenshot-153148

homocomputeris commented 1 year ago

@homocomputeris Different visual content of input / output PDF explain the compression in size.

But in my 2nd example I set --optimization 0. Why does the quality get worse?

florisre commented 8 months ago

On a related note, the file size ratio seems to be inverted. For example, I get Total file size ratio: 0.21 savings: -374.0%, which does not match. Savings of more than 100% are mathematically not possible, it seems to be calculated as $\text{old size} \over \text{new size}$ instead of $1- {\text{new size} \over \text{old size}}$.

Unfortunately, I am not able to check whether this is correct in the latest release, as I am on Fedora and only have access to version 15.4.3 through its sources. Thus, I decided to high jack this issue. If someone confirms that this remains an issue in the latest version, I will open a separate issue.