Optimize image data in content streams

Thank you for suggesting this feature! It would be nice to have. See my analysis below.

The p1.pdf file attached to the issue contains a content stream with a large image encoded with flate, and then with ASCII85. The image in the content stream looks like this:

BI 
/W 1062.00 /H 1425.00 /BPC 8 /CS /RGB /F [/A85 /Fl] 
ID 
... (lots of image data)
...~>
EI

If the image data was within an /Image object (referenced by the content stream) instead, then pdfsizeopt would already optimize the image object, and that would include decoding of the ASCII85 stream to binary, making it smaller.

This is what needs to be done to add optimization of image data in content streams to pdfsizeopt:

Adding a content stream parser which can reliably find embedded image data.
- Detecting the beginning of the image data is easy (tokenize the content stream and search for the first BI token).
- Detecting the end of the image data is hard. (The simple idea of searching forward for the EI token isn't reliable enough, EI surrounded by whitespace can be found too early as a false positive, in middle of binary image data.) We'd at least have to add a parser for compressed and encoded streams (at least /FlateDecode, /DCTDecode, /LZWDecode, /ASCII85Decode, ASCIIHexDecode, /RunLengthDecode; and also maybe /CCITTFaxDecode), just to find the end of the image stream. This is would be a relatively large effort with many corner cases for each filter. And for uncompressed streams we'd have to add byte counting (this is relatively easy).
Adding a converter which can convert image header to object header, e.g. the one above to /Width 1062 /Height 1425 /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [/ASCII85Decode /FlateDecode] and back. This is relatively easy.
Adding a call to the image object optimizer code. This is also easy.

Because of the relatively large amount of work in the first step, this feature is unlikely to be implemented soon unless it gets funding.

pts / pdfsizeopt

Optimize image data in content streams #102