AssertionError on PDF with CCITT compressed content

hannob commented 6 years ago

The attached PDF will cause an AssertionError with pdfsizeopt. The file is produced by a brother scanner (DCP-135C) and it seems to use CCITT encoded image data (black/white images). pdf-blank-page.zip

With latest pdfsizeopt I get an error like this:


info: This is pdfsizeopt ZIP rUNKNOWN size=68556.
info: prepending to PATH: /tmp/pdfsizeopt/pdfsizeopt_libexec
info: loading PDF from: pdf-blank-page.pdf
info: loaded PDF of 10555 bytes
info: separated to 8 objs + xref + trailer
info: parsed 8 objs
info: eliminated 2 unused objs, depth=6
info: found 0 Type1 fonts loaded
info: found 0 Type1C fonts loaded
info: will optimize image XObject 3; orig width=1728 height=2236 colorspace=/DeviceGray bpc=1 inv=False filter=/CCITTFaxDecode dp=1 size=9676 gs_device=pngmono
info: optimizing 1 images of 9676 bytes in total
info: writing ImageRenderer (9700 image bytes) to: psotmp.7665.conv.pngmono.tmp.ps
info: using Ghostscript TMPDIR=. TEMP=. gs: GPL Ghostscript 9.05 (2012-02-08)
info: executing ImageRenderer with Ghostscript: TMPDIR=. TEMP=. gs -q -P- -dNOPAUSE -dBATCH -sDEVICE=pngmono -sOutputFile='psotmp.7665.img-%04d.pngmono.tmp.png' -f psotmp.7665.conv.pngmono.tmp.ps
ImageRenderer: rendering image XObject 3 width=1728 height=2236 bpc=1 colorspace=/DeviceGray filter=/CCITTFaxDecode decodeparms=<< /Columns 1728 /EncodedByteAlign true /K 0 >> device=pngmono
ImageRenderer: all OK
info: loading image from: psotmp.7665.img-0001.pngmono.tmp.png
info: loaded PNG IDAT of 3832 bytes
Traceback (most recent call last):
  File "/proc/self/exe/runpy.py", line 162, in _run_module_as_main
  File "/proc/self/exe/runpy.py", line 72, in _run_code
  File "./pdfsizeopt.single/__main__.py", line 1, in <module>
  File "./pdfsizeopt.single/m.py", line 6, in <module>
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 5507, in main
  File "./pdfsizeopt.single/pdfsizeopt/main.py", line 4261, in OptimizeImages
AssertionError

pts commented 6 years ago

Thank you for reporting this bug! Fixed in 69c6b22728f2e3c1fe33b26fcee189167bbcd6d4. Upgrade to the latest pdfsizeopt.

The bug was triggered when an image object had its /Width or /Height as a reference (... ... R).

rbrito commented 5 years ago

Ah, great, @pts. This means that (at least in part), pdfsizeopt is able to inline objects like integers (one step closer to what Multivalent does, right?).

pts commented 5 years ago

pdfsizeopt is inlining objects in some cases (e.g. /Length' values, image/Width, image/Height` values and other image values used by image recompression).

It would be awesome to add an optimization which inlines every non-stream object which is used only once (or twice... if it's small enough), restricted to those object types which the PDF specification allows to be inlined. Writing the Python code for the inlining is relatively easy, but compiling a full and reliable list of what is allowed to be inlined is the tedious part. You may want to file a separate issue for this.

pts / pdfsizeopt

AssertionError on PDF with CCITT compressed content #91