Open rbrito opened 5 years ago
Thank you for suggesting this feature! It would be nice to have. See my analysis below.
The p1.pdf file attached to the issue contains a content stream with a large image encoded with flate, and then with ASCII85. The image in the content stream looks like this:
BI
/W 1062.00 /H 1425.00 /BPC 8 /CS /RGB /F [/A85 /Fl]
ID
... (lots of image data)
...~>
EI
If the image data was within an /Image
object (referenced by the content stream) instead, then pdfsizeopt would already optimize the image object, and that would include decoding of the ASCII85 stream to binary, making it smaller.
This is what needs to be done to add optimization of image data in content streams to pdfsizeopt:
BI
token).EI
token isn't reliable enough, EI
surrounded by whitespace can be found too early as a false positive, in middle of binary image data.) We'd at least have to add a parser for compressed and encoded streams (at least /FlateDecode
, /DCTDecode
, /LZWDecode
, /ASCII85Decode
, ASCIIHexDecode
, /RunLengthDecode
; and also maybe /CCITTFaxDecode
), just to find the end of the image stream. This is would be a relatively large effort with many corner cases for each filter. And for uncompressed streams we'd have to add byte counting (this is relatively easy)./Width 1062 /Height 1425 /BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter [/ASCII85Decode /FlateDecode]
and back. This is relatively easy.Because of the relatively large amount of work in the first step, this feature is unlikely to be implemented soon unless it gets funding.
Hi, @pts.
I recently got a file that was very large and, upon inspecting it, it had a lot of ASCII85 encoded content (the "real" content that the ASCII85 is protecting is a deflated stream).
I extracted the first page and I'm attaching it here.
It would be very nice to have the stream decoded, since it could surely be made much smaller, just with that, let alone together with other optimizations.
Thanks,
Rogério Brito. p1.pdf