pts / pdfsizeopt

PDF file size optimizer
GNU General Public License v2.0
750 stars 65 forks source link

/Type /Xobject (with small o) images not extracted/optimized #133

Closed rbrito closed 1 year ago

rbrito commented 4 years ago

Dear @pts,

I found a file that has CCITT images that the latest pdfsizeopt doesn't extract/optimize.

Here is the output of running pdfsizeopt on such file:

$ python pdfsizeopt.single --use-multivalent=no --do-optimize-images=yes orlin.pdf 
info: This is pdfsizeopt ZIP rUNKNOWN size=69734.
info: prepending to PATH: /tmp
info: loading PDF from: orlin.pdf
info: loaded PDF of 1987849 bytes
info: using Ghostscript TMPDIR=. TEMP=. gs: GPL Ghostscript 9.22 (2017-10-04)
info: decompressing 251 bytes with Ghostscript /Filter/FlateDecode/DecodeParms <</Predictor 12/Columns 5>>
info: found 136 obj offsets and 1 obj streams in xref stream
info: separated to 134 objs + xref + trailer
info: parsed 134 objs
info: found 0 Type1 fonts loaded
info: found 0 Type1C fonts loaded
info: optimized 29 streams, kept 29 #orig
info: compressed 0 streams, kept 0 of them uncompressed
info: saving PDF with 134 objs to: orlin.pso.pdf
info: generated object stream of 1067 bytes in 76 objects (6%)
info: generated 1987849 bytes (100%)

Here are the images that are contained in that file:

$ pdfimages -list orlin.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    3424  4416  gray    1   1  ccitt  no        78  0   401   400 5252B 0.3%
   2     1 image    3424  4416  gray    1   1  ccitt  no        80  0   401   400 6279B 0.3%
   3     2 image    3424  4416  gray    1   1  ccitt  no        82  0   401   400 44.7K 2.4%
   4     3 image    3456  4448  gray    1   1  ccitt  no        84  0   400   400 97.6K 5.2%
   5     4 image    3456  4448  gray    1   1  ccitt  no        86  0   400   400 60.8K 3.2%
   6     5 image    3424  4416  gray    1   1  ccitt  no        88  0   401   400 84.7K 4.6%
   7     6 image    3424  4416  gray    1   1  ccitt  no        90  0   401   400 70.8K 3.8%
   8     7 image    3424  4416  gray    1   1  ccitt  no        92  0   401   400 62.8K 3.4%
   9     8 image    3424  4416  gray    1   1  ccitt  no        94  0   401   400 67.3K 3.6%
  10     9 image    3424  4416  gray    1   1  ccitt  no        96  0   401   400 59.4K 3.2%
  11    10 image    3424  4416  gray    1   1  ccitt  no        98  0   401   400 63.9K 3.5%
  12    11 image    3424  4416  gray    1   1  ccitt  no       100  0   401   400 51.7K 2.8%
  13    12 image    3424  4416  gray    1   1  ccitt  no       102  0   401   400 51.1K 2.8%
  14    13 image    3424  4416  gray    1   1  ccitt  no       104  0   401   400 50.5K 2.7%
  15    14 image    3424  4416  gray    1   1  ccitt  no       106  0   401   400 63.2K 3.4%
  16    15 image    3424  4416  gray    1   1  ccitt  no       108  0   401   400 54.2K 2.9%
  17    16 image    3424  4416  gray    1   1  ccitt  no       110  0   401   400 47.6K 2.6%
  18    17 image    3424  4416  gray    1   1  ccitt  no       112  0   401   400 44.6K 2.4%
  19    18 image    3424  4416  gray    1   1  ccitt  no       114  0   401   400 93.4K 5.1%
  20    19 image    3424  4416  gray    1   1  ccitt  no       116  0   401   400 79.3K 4.3%
  21    20 image    3424  4416  gray    1   1  ccitt  no       118  0   401   400 86.6K 4.7%
  22    21 image    3456  4448  gray    1   1  ccitt  no       120  0   400   400 72.8K 3.9%
  23    22 image    3424  4416  gray    1   1  ccitt  no       122  0   401   400 81.9K 4.4%
  24    23 image    3424  4416  gray    1   1  ccitt  no       124  0   401   400 88.6K 4.8%
  25    24 image    3424  4416  gray    1   1  ccitt  no       126  0   401   400 83.6K 4.5%
  26    25 image    3424  4416  gray    1   1  ccitt  no       128  0   401   400 73.9K 4.0%
  27    26 image    3424  4416  gray    1   1  ccitt  no       130  0   401   400 80.0K 4.3%
  28    27 image    3424  4416  gray    1   1  ccitt  no       132  0   401   400 84.4K 4.6%
  29    28 image    3424  4416  gray    1   1  ccitt  no       134  0   401   400 30.7K 1.7%

I'm attaching the file in question here.

orlin.pdf

Thanks,

Rogério Brito.

rbrito commented 4 years ago

I have not read the specs (so, I'm not sure if I am talking something "legal" or "illegal"), but the CCITT images in the file above have their dictionary with /Type /Xobject (with lowercase o, instead of /XObject) and removing those key-value pairs make pdfsizeopt compress the corresponding image as JBIG2.

You can cherry-pick a very ugly (but working) hack is currently at: https://github.com/rbrito/pdfsizeopt/commit/5dfacc75ab39e263885db3de249d20676989baed

Regards,

Rogério Brito.

zvezdochiot commented 4 years ago

Added:

pdfinfo orlin.pdf

for complete report, please.

rbrito commented 4 years ago

Added:

pdfinfo orlin.pdf

for complete report, please.

I don't know if you are asking me to provide this (the file is readily available here in my original report), but here you go:

$ pdfinfo orlin.pdf 
Creator:         
Producer:        
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          29
Encrypted:      no
Page size:      616.32 x 794.88 pts
Page rot:       0
File size:      1987849 bytes
Optimized:      no
PDF version:    1.5

And just to be complete, here goes the output of pdfimages before and after a run with my patched version of pdfsizeopt:

$ pdfimages -list orlin.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    3424  4416  gray    1   1  ccitt  no        78  0   401   400 5252B 0.3%
   2     1 image    3424  4416  gray    1   1  ccitt  no        80  0   401   400 6279B 0.3%
   3     2 image    3424  4416  gray    1   1  ccitt  no        82  0   401   400 44.7K 2.4%
   4     3 image    3456  4448  gray    1   1  ccitt  no        84  0   400   400 97.6K 5.2%
   5     4 image    3456  4448  gray    1   1  ccitt  no        86  0   400   400 60.8K 3.2%
   6     5 image    3424  4416  gray    1   1  ccitt  no        88  0   401   400 84.7K 4.6%
   7     6 image    3424  4416  gray    1   1  ccitt  no        90  0   401   400 70.8K 3.8%
   8     7 image    3424  4416  gray    1   1  ccitt  no        92  0   401   400 62.8K 3.4%
   9     8 image    3424  4416  gray    1   1  ccitt  no        94  0   401   400 67.3K 3.6%
  10     9 image    3424  4416  gray    1   1  ccitt  no        96  0   401   400 59.4K 3.2%
  11    10 image    3424  4416  gray    1   1  ccitt  no        98  0   401   400 63.9K 3.5%
  12    11 image    3424  4416  gray    1   1  ccitt  no       100  0   401   400 51.7K 2.8%
  13    12 image    3424  4416  gray    1   1  ccitt  no       102  0   401   400 51.1K 2.8%
  14    13 image    3424  4416  gray    1   1  ccitt  no       104  0   401   400 50.5K 2.7%
  15    14 image    3424  4416  gray    1   1  ccitt  no       106  0   401   400 63.2K 3.4%
  16    15 image    3424  4416  gray    1   1  ccitt  no       108  0   401   400 54.2K 2.9%
  17    16 image    3424  4416  gray    1   1  ccitt  no       110  0   401   400 47.6K 2.6%
  18    17 image    3424  4416  gray    1   1  ccitt  no       112  0   401   400 44.6K 2.4%
  19    18 image    3424  4416  gray    1   1  ccitt  no       114  0   401   400 93.4K 5.1%
  20    19 image    3424  4416  gray    1   1  ccitt  no       116  0   401   400 79.3K 4.3%
  21    20 image    3424  4416  gray    1   1  ccitt  no       118  0   401   400 86.6K 4.7%
  22    21 image    3456  4448  gray    1   1  ccitt  no       120  0   400   400 72.8K 3.9%
  23    22 image    3424  4416  gray    1   1  ccitt  no       122  0   401   400 81.9K 4.4%
  24    23 image    3424  4416  gray    1   1  ccitt  no       124  0   401   400 88.6K 4.8%
  25    24 image    3424  4416  gray    1   1  ccitt  no       126  0   401   400 83.6K 4.5%
  26    25 image    3424  4416  gray    1   1  ccitt  no       128  0   401   400 73.9K 4.0%
  27    26 image    3424  4416  gray    1   1  ccitt  no       130  0   401   400 80.0K 4.3%
  28    27 image    3424  4416  gray    1   1  ccitt  no       132  0   401   400 84.4K 4.6%
  29    28 image    3424  4416  gray    1   1  ccitt  no       134  0   401   400 30.7K 1.7%
$ pdfimages -list orlin.pso.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    3424  4416  gray    1   1  jbig2  no        78  0   401   400 4060B 0.2%
   2     1 image    3424  4416  gray    1   1  jbig2  no        80  0   401   400 4992B 0.3%
   3     2 image    3424  4416  gray    1   1  jbig2  no        82  0   401   400 34.2K 1.9%
   4     3 image    3456  4448  gray    1   1  jbig2  no        84  0   400   400 73.1K 3.9%
   5     4 image    3456  4448  gray    1   1  jbig2  no        86  0   400   400 46.3K 2.5%
   6     5 image    3424  4416  gray    1   1  jbig2  no        88  0   401   400 64.1K 3.5%
   7     6 image    3424  4416  gray    1   1  jbig2  no        90  0   401   400 53.7K 2.9%
   8     7 image    3424  4416  gray    1   1  jbig2  no        92  0   401   400 47.5K 2.6%
   9     8 image    3424  4416  gray    1   1  jbig2  no        94  0   401   400 50.7K 2.7%
  10     9 image    3424  4416  gray    1   1  jbig2  no        96  0   401   400 45.1K 2.4%
  11    10 image    3424  4416  gray    1   1  jbig2  no        98  0   401   400 47.1K 2.6%
  12    11 image    3424  4416  gray    1   1  jbig2  no       100  0   401   400 38.9K 2.1%
  13    12 image    3424  4416  gray    1   1  jbig2  no       102  0   401   400 38.0K 2.1%
  14    13 image    3424  4416  gray    1   1  jbig2  no       104  0   401   400 38.5K 2.1%
  15    14 image    3424  4416  gray    1   1  jbig2  no       106  0   401   400 46.8K 2.5%
  16    15 image    3424  4416  gray    1   1  jbig2  no       108  0   401   400 40.0K 2.2%
  17    16 image    3424  4416  gray    1   1  jbig2  no       110  0   401   400 35.3K 1.9%
  18    17 image    3424  4416  gray    1   1  jbig2  no       112  0   401   400 34.0K 1.8%
  19    18 image    3424  4416  gray    1   1  jbig2  no       114  0   401   400 69.9K 3.8%
  20    19 image    3424  4416  gray    1   1  jbig2  no       116  0   401   400 59.8K 3.2%
  21    20 image    3424  4416  gray    1   1  jbig2  no       118  0   401   400 65.0K 3.5%
  22    21 image    3456  4448  gray    1   1  jbig2  no       120  0   400   400 54.3K 2.9%
  23    22 image    3424  4416  gray    1   1  jbig2  no       122  0   401   400 62.4K 3.4%
  24    23 image    3424  4416  gray    1   1  jbig2  no       124  0   401   400 67.3K 3.6%
  25    24 image    3424  4416  gray    1   1  jbig2  no       126  0   401   400 62.1K 3.4%
  26    25 image    3424  4416  gray    1   1  jbig2  no       128  0   401   400 55.0K 3.0%
  27    26 image    3424  4416  gray    1   1  jbig2  no       130  0   401   400 60.3K 3.3%
  28    27 image    3424  4416  gray    1   1  jbig2  no       132  0   401   400 64.2K 3.5%
  29    28 image    3424  4416  gray    1   1  jbig2  no       134  0   401   400 23.2K 1.3%
zvezdochiot commented 4 years ago

@rbrito add issues (Settings) in you fork, please.

zvezdochiot commented 4 years ago

@rbrito, hello.

There is a PDF in which the illustrations lack /Type/XObject, but /Subtype /Image is present. Can this be used?

rbrito commented 4 years ago

Hi, @zvezdochiot.

@rbrito add issues (Settings) in you fork, please.

Just did that. Please, feel free to submit things there... On the other hand, if you can, please try to reproduce the problem with @pts's version before you file a bug in my repository (the idea is to have issues filed to my repository only if they belong to my own work and not in all copies).

Regarding your second question, I don't have the code here, but I would like to see one such PDF file and see what to do.

Regards,

Rogério Brito.

zvezdochiot commented 4 years ago

@rbrito say:

I would like to see one such PDF file and see what to do.

https://cyberleninka.ru/article/n/metod-segmentatsii-izobrazheniya-dlya-raspoznavaniya-pechatnyh-dokumentov/pdf

pts commented 1 year ago

Fixed in 88263ef67fbd7478cf4ed29708f49043343a0fa5.

@zvezdochiot, please file a separate issue if you have a PDF file which pdfsizeopt breaks.