ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.1k stars 1.02k forks source link

Images with "image" encoding do not get compressed #998

Closed stepelu closed 2 years ago

stepelu commented 2 years ago

Describe the bug

Some images, which according to pdfimages -list are encoded as:

raster image (may be Flate or LZW compressed but does not use an image encoding)

do not get compressed.

To Reproduce

Consider the PDF file out.pdf, it contains 4 grayscale images on 2 pages.

(py3env) ~/D/pdf2 ❯❯❯ pdfimages -list out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    3675  2775  gray    1   8  image  no         3  0   600   600  135K 1.4%
   1     1 image    3675  2775  gray    1   8  image  no         2  0   600   600  129K 1.3%
   2     2 image    3675  2775  gray    1   8  image  no        14  0   600   600  135K 1.4%
   2     3 image    3675  2775  gray    1   8  image  no         8  0   600   600 87.0K 0.9%

Running ocrmypdf with ocrmypdf --jbig2-lossy out.pdf ocr-out.pdf do not result in a file with compressed images:

(py3env) ~/D/pdf2 ❯❯❯ pdfimages -list ocr-out.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    3675  2775  gray    1   8  image  no        14  0   600   600  135K 1.4%
   1     1 image    3675  2775  gray    1   8  image  no        15  0   600   600  129K 1.3%
   2     2 image    3675  2775  gray    1   8  image  no        20  0   600   600  135K 1.4%
   2     3 image    3675  2775  gray    1   8  image  no        21  0   600   600 87.0K 0.9%

The log of ocrmypdf with -v 1 is:

(py3env) ~/D/pdf2 ❯❯❯ ocrmypdf -v 1 --jbig2-lossy out.pdf ocr-out.pdf
ocrmypdf 13.6.0
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/opt/homebrew/share/tessdata/" (3):
eng
osd
snum

Running: ['tesseract', '--version']
Found tesseract 5.2.0
Running: ['gs', '--version']
Found gs 9.56.1
os.symlink(out.pdf, /var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/origin)
os.symlink(/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/origin, /var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/origin.pdf)
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 314.52page/s]
Using Tesseract OpenMP thread limit 3
Start processing 2 pages concurrently
    1 Rasterize with pnggray, rotation 0
    2 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r600.000000x600.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/origin.pdf']
    2 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=2', '-dLastPage=2', '-r600.000000x600.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/origin.pdf']
    2 Rotating output by 0
    1 Rotating output by 0
    2 resolution (599.9988, 599.9988)
    1 resolution (599.9988, 599.9988)
    2 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/000002_ocr.png', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/000002_ocr_tess', 'pdf', 'txt']
    1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/000001_ocr.png', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/000001_ocr_tess', 'pdf', 'txt']
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    2 Grafting
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
OCR: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.0/2.0 [00:03<00:00,  1.61s/page]
Postprocessing...
os.symlink(/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/graft_layers.pdf, /var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/fix_docinfo.pdf', '/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/pdfa.ps']
GPL Ghostscript 9.56.1 (2022-04-04)
Copyright (C) 2022 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 2.
Page 1
Page 2
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.62page/s]
xref 21: treating as an optimization candidate
xref 22: treating as an optimization candidate
xref 25: treating as an optimization candidate
xref 24: treating as an optimization candidate
xref 24: skipping image with Decode table
xref 25: skipping image with Decode table
xref 21: skipping image with Decode table
xref 22: skipping image with Decode table
Optimizable images: JPEGs: 0 PNGs: 0
Recompressing JPEGs: 0image [00:00, ?image/s]
xref 21: treating as an optimization candidate
xref 22: treating as an optimization candidate
xref 25: treating as an optimization candidate
xref 24: treating as an optimization candidate
xref 24: skipping image with Decode table
xref 25: skipping image with Decode table
xref 21: skipping image with Decode table
xref 22: skipping image with Decode table
Deflating JPEGs: 0image [00:00, ?image/s]
xref 21: treating as an optimization candidate
xref 22: treating as an optimization candidate
xref 25: treating as an optimization candidate
xref 24: treating as an optimization candidate
xref 24: skipping image with Decode table
xref 25: skipping image with Decode table
xref 21: skipping image with Decode table
xref 22: skipping image with Decode table
Optimizable images: JBIG2 groups: 0
JBIG2: 0item [00:00, ?item/s]
os.symlink(/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/optimize.opt.pdf, /var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/optimize.pdf)
Running: ['jbig2', '--version']
Running: ['pngquant', '--version']
Optimize ratio: 1.00 savings: 0.0%
/var/folders/bb/mchzrbvs73n42w10_drjxvvr0000gn/T/ocrmypdf.io.qzjlul5f/optimize.pdf -> ocr-out.pdf
Output file is a PDF/A-2B (as expected)

It seems that these images are not considered as candidates for JBIG2 compression.

Example file

See above.

Expected behavior

The images are grayscale, so with default optimization level 1 should get compressed to lossy jbig2 (given the passed options). This happens for instance for images that are encoded as ccitt as reported from pdfimages -list, i.e. this case works fine.

System

jbarlow83 commented 2 years ago

According to the logs these images have a Decode table, which works like a color filter applied to the image data. The basic use is an optimized monochrome image used to render color.

Ocrmypdf skips such images (and actually, any image that doesn't look simple and safe, because it is an archival tool) because they are not common and there are significant complexities to modifying them in a way that will not change the appearance.

I might be able to whitelist this particular case if the images are "simple" enough.

stepelu commented 2 years ago

I understand. In the meantime I found another procedure that avoids the creation of these images with the Decode table (these originated by a pdftops for rasterization followed by ps2pf to get a pdf back). So I am closing the ticket, thanks for the prompt reply!