pts / pdfsizeopt

PDF file size optimizer
GNU General Public License v2.0
737 stars 64 forks source link

improve Flate compression with Zopfli, ECT and advzip #37

Open pts opened 6 years ago

pts commented 6 years ago

This doesn't apply to images embedded to the PDF, but all calls to zlib.compress in main.py. Some of the Flate compressors are very slow, so we probably need some caching which persists across invocations of pdfsizeopt.

squinky86 commented 6 years ago

I tried using zopfli.zlib.compress(x) from the py-zopfli module instead of zlib.compress(x, 9) and indeed get an improvement in size (though it took longer for pdfsizeopt to run):

$ du -sh testpdf.*pdf 308K testpdf.pdf 112K testpdf.zlib.pdf 108K testpdf.zopfli.pdf

$ time pdfsizeopt-zopfli testpdf.pdf ... info: generated 107524 bytes (34%)

real 0m19.144s user 0m15.421s sys 0m0.325s

$ time pdfsizeopt testpdf.pdf ... info: generated 114113 bytes (37%)

real 0m2.038s user 0m0.109s sys 0m0.170s

Patch attached for reference. pdfsizeopt-zopfli.txt

pts commented 6 years ago

Thank you for trying Zopfli compression, and thank you for publishing your measurements.

Could you please upload testpdf.pdf to make your measurements repeatable?

Zopfli will not be enabled by default in pdfsizeopt, because it's prohibitively slow. Also pdfsizeopt sometimes decompresses and compresses the same object 2 or 3 times (unnecessarily). Until this is refactored and useful persistent caching is added, I'm not eager to add support for slow compressors. Please note that you can already use zopflipng with pdfsizeopt for image compression: pdfsizeopt --use-image-optimizer=zopflipng.

squinky86 commented 6 years ago

lipsum.zlib.pdf lipsum.zopfli.pdf lipsum.pdf

The test pdf I was using has some copyrighted information in it, so I made a new test with latex and have included it here. I also ran this on a much faster laptop (i7) under cygwin.

Also, I agree - this should be an optional enhancement, not the default. Forcing users to take 5-100 times as long (depending on the size of their file) is too much of a performance hit to be the default, in my opinion.

lipsum.tex: \documentclass{report} \usepackage{lipsum} \begin{document} \chapter{Lipsum} \lipsum[1-150] \end{document}

$ time pdfsizeopt --do-require-image-optimizers=no lipsum.pdf info: This is pdfsizeopt rUNKNOWN size=388706. ... info: generated 82144 bytes (79%)

real 0m0.772s user 0m0.353s sys 0m0.463s

$ time pdfsizeopt-zopfli --do-require-image-optimizers=no lipsum.pdf info: This is pdfsizeopt rUNKNOWN size=388706. ... info: generated 78533 bytes (76%)

real 0m4.914s user 0m3.762s sys 0m0.432s

squinky86 commented 6 years ago

New data point: My workplace is hesitant to install pngout since it is closed-source, and we can't verify the source with our software assurance teams. Instead, we're using pngwolf-zopfli: https://github.com/jibsen/pngwolf-zopfli

Using http://r0k.us/graphics/kodak/ as a test suite, pngout vs. pngwolf-zopfli produced the following:

$ du -b *.png 736501 kodim01.png 694640 kodim01.pngout.png 679555 kodim01.pngwolf.png 617995 kodim02.png 609270 kodim02.pngout.png 586211 kodim02.pngwolf.png 502888 kodim03.png 502888 kodim03.pngout.png 480385 kodim03.pngwolf.png 637432 kodim04.png 622590 kodim04.pngout.png 601154 kodim04.pngwolf.png 785610 kodim05.png 760504 kodim05.pngout.png 750614 kodim05.pngwolf.png 618959 kodim06.png 604212 kodim06.pngout.png 587599 kodim06.pngwolf.png 566322 kodim07.png 558504 kodim07.pngout.png 538315 kodim07.pngwolf.png 788470 kodim08.png 743878 kodim08.pngout.png 739509 kodim08.pngwolf.png 582899 kodim09.png 563175 kodim09.pngout.png 521646 kodim09.pngwolf.png 593463 kodim10.png 569616 kodim10.pngout.png 554627 kodim10.pngwolf.png 621023 kodim11.png 600315 kodim11.pngout.png 575097 kodim11.pngwolf.png 531024 kodim12.png 531024 kodim12.pngout.png 506176 kodim12.pngwolf.png 822712 kodim13.png 804030 kodim13.pngout.png 789194 kodim13.pngwolf.png 692201 kodim14.png 692201 kodim14.pngout.png 663963 kodim14.pngwolf.png 612582 kodim15.png 597406 kodim15.pngout.png 569791 kodim15.pngwolf.png 534247 kodim16.png 533261 kodim16.pngout.png 505048 kodim16.pngwolf.png 602078 kodim17.png 598087 kodim17.pngout.png 569016 kodim17.pngwolf.png 780947 kodim18.png 754566 kodim18.pngout.png 747900 kodim18.pngwolf.png 671476 kodim19.png 637451 kodim19.pngout.png 619245 kodim19.pngwolf.png 492462 kodim20.png 476252 kodim20.pngout.png 468370 kodim20.pngwolf.png 637051 kodim21.png 636273 kodim21.pngout.png 602484 kodim21.pngwolf.png 701970 kodim22.png 690600 kodim22.pngout.png 675837 kodim22.pngwolf.png 557596 kodim23.png 552240 kodim23.pngout.png 531785 kodim23.pngwolf.png 706397 kodim24.png 678208 kodim24.pngout.png 663763 kodim24.pngwolf.png

pts commented 6 years ago

Thank you for the data points! It's nice to learn about pngwolf-zopfli.

Please note that PNG compression is offtopic in this issue (because it's about Flate compression for non-images). To continue the discussion on how PNG compression can be improved in pdfsizeopt, please open a separate issue.

Please note that compression speed also matters, it would be nice to have speed and compression ratio numbers for pngout, zopflipng and pngwolf-zopfli. Please don't add them here, but in a separate issue.

Please note that pdfsizeopt already supports zopflipng with --use-image-optimizer=zopflipng, and it can also use arbitrary PNG optimizers by specifying a shell command template in the argument of --use-image-optimizer=. So you can use pngwolf-zopfli without any code change in pdfsizeopt. If there is any way you think support can be improved (e.g. --use-image-optimizer=pngwolf and --use-image-optimizer=pngwolf-zopfli), please open a new issue for that, and preferably also supply a patch.