pts / pdfsizeopt

PDF file size optimizer
GNU General Public License v2.0
751 stars 65 forks source link

Optimize image data in content streams #102

Open rbrito opened 5 years ago

rbrito commented 5 years ago

Hi, @pts.

I recently got a file that was very large and, upon inspecting it, it had a lot of ASCII85 encoded content (the "real" content that the ASCII85 is protecting is a deflated stream).

I extracted the first page and I'm attaching it here.

It would be very nice to have the stream decoded, since it could surely be made much smaller, just with that, let alone together with other optimizations.

Thanks,

Rogério Brito. p1.pdf

pts commented 5 years ago

Thank you for suggesting this feature! It would be nice to have. See my analysis below.

The p1.pdf file attached to the issue contains a content stream with a large image encoded with flate, and then with ASCII85. The image in the content stream looks like this:

BI 
/W 1062.00 /H 1425.00 /BPC 8 /CS /RGB /F [/A85 /Fl] 
ID 
... (lots of image data)
...~>
EI

If the image data was within an /Image object (referenced by the content stream) instead, then pdfsizeopt would already optimize the image object, and that would include decoding of the ASCII85 stream to binary, making it smaller.

This is what needs to be done to add optimization of image data in content streams to pdfsizeopt:

  1. Adding a content stream parser which can reliably find embedded image data.
    • Detecting the beginning of the image data is easy (tokenize the content stream and search for the first BI token).
    • Detecting the end of the image data is hard. (The simple idea of searching forward for the EI token isn't reliable enough, EI surrounded by whitespace can be found too early as a false positive, in middle of binary image data.) We'd at least have to add a parser for compressed and encoded streams (at least /FlateDecode, /DCTDecode, /LZWDecode, /ASCII85Decode, ASCIIHexDecode, /RunLengthDecode; and also maybe /CCITTFaxDecode), just to find the end of the image stream. This is would be a relatively large effort with many corner cases for each filter. And for uncompressed streams we'd have to add byte counting (this is relatively easy).
  2. Adding a converter which can convert image header to object header, e.g. the one above to /Width 1062 /Height 1425 /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [/ASCII85Decode /FlateDecode] and back. This is relatively easy.
  3. Adding a call to the image object optimizer code. This is also easy.

Because of the relatively large amount of work in the first step, this feature is unlikely to be implemented soon unless it gets funding.