Code works, but pdf has the same size

SvennoNito commented 6 years ago

Hey community, I'm trying to optimize a very large PDF (>100mb) on Windows using C:\pdfsizeopt\pdfsizeopt C:\pdfs\MA.pdf C:\pdfs\MA_optimized.pdf. It runs through, but the output file has the very same size as my input file. These are the info messages that I get. What do I do wrong?


info: This is pdfsizeopt ZIP rUNKNOWN size=68657.
info: prepending to PATH: C:\pdfsizeopt\pdfsizeopt_win32exec
info: loading PDF from: C:\pdfs\MA.pdf
info: loaded PDF of 124154212 bytes
info: separated to 989 objs + xref + trailer
info: parsed 989 objs
info: found 0 Type1 fonts loaded
info: found 0 Type1C fonts loaded
info: optimized 70 streams, kept 70 #orig
info: eliminated 66 duplicate objs
info: compressed 0 streams, kept 0 of them uncompressed
info: saving PDF with 923 objs to: C:\pdfs\MA_optimized.pdf
info: generated object stream of 4045 bytes in 141 objects (8%)
info: generated 124062005 bytes (100%)

pts commented 6 years ago

It's not exactly the same size, the output PDF is 92207 bytes smaller than the input PDF.

One possible reason is that the PDF contains only vector graphics (no bitmap images, no fonts). pdfsizeopt doesn't have an optimization algorithm for vector graphics, so it just keeps vector graphics intact (except for recompressing it).

Another possible reason is that the PDF file contains lots of TrueType or OpenType fonts. Again, pdfsizeopt doesn't have an optimization algorithm for these font types, so it just keeps these fonts intact.

Another possible reason is that there is bug in pdfsizeopt, and it doesn't notice some fonts or bitmap images that could be optimized.

To get a more accurate explanation, you may want to run pdfsizeopt --stats C:\pdfs\MA.pdf and copy-paste the output to this bug, or upload the input PDF here.

pts commented 6 years ago

A possible improvement would be adding an info message like this:

info: keeping X bytes in X context streams (vector graphics), X bytes in X TrueType/OpenType fonts, X bytes in X other objs intact

SvennoNito commented 6 years ago

Thanks pts! My pdf includes no vector graphics but ~20 .png graphics all <1mb. When I run C:\pdfsizeopt\pdfsizeopt --stats C:\pdfs\MA.pdf I get

info: This is pdfsizeopt ZIP rUNKNOWN size=68657.
info: computing statistics for PDF: C:\pdfs\MA.pdf
info: PDF size is 124154212 bytes
info: stat drawing_objs = 31707 bytes (0.03%)
info: stat font_data_objs = 0 bytes (0.00%)
info: stat footer = 26 bytes (0.00%)
info: stat header = 9 bytes (0.00%)
info: stat jpeg_image_objs = 124073372 bytes (99.93%)
info: stat linearized_xref = 0 bytes (0.00%)
info: stat nonjpeg_image_objs = 0 bytes (0.00%)
info: stat other_nonstream_objs = 29235 bytes (0.02%)
info: stat other_stream_objs = 0 bytes (0.00%)
info: stat trailer = 50 bytes (0.00%)
info: stat wasted_between_objs = 1 bytes (0.00%)
info: stat xref = 19812 bytes (0.02%)
info: end of stats

Which is interesting. I assume that jpeg_image_objs of 99% means the size of the pdf comes from images?

pts commented 6 years ago

Yes, this PDF contains many JPEG images (probably each page is one big image). pdfsizeopt doesn't contain any algorithm to make JPEG images smaller, so it just copies them around.

To make the PDF smaller, it would be possible to downscale (resize) the JPEG images, and/or to recompress them with a lower quality setting. However, pdfsizeopt is unable to do so (one reason for that is that pdfsizeopt does only visually lossless transformations by design), and it's unlikely that this feature gets introduced soon, except if someone volunteers to implement it.

pts / pdfsizeopt

Code works, but pdf has the same size #92