pts / pdfsizeopt

PDF file size optimizer
GNU General Public License v2.0
764 stars 66 forks source link

add lossless optimizations for JPEG images embedded into the PDF #41

Open pts opened 7 years ago

pts commented 7 years ago
  1. EXIF tags etc. can be removed by Python code, see Removing JPEG metadata (e.g. comments, JFIF, Exif etc.) manually in info.txt for details. Don't use jpegtran -copy none, because it would keep some unnecessary metadata.

  2. Smaller Huffman tables can be generated. jpegtran -optimize can do this. mozjpeg's jpegtran doesn't create any smaller files (this is not true, double check the size difference). This is equivalent to the lossless mode of jpegoptim and imgopt.

  3. This is only true for PDF 1.2 and earlier: jpgcrush and jpegrescan cannot be used, because they create progressive JPEG output, which PDF doesn't support. FYI If mozjpeg's jpegtran is used, then it should be invoke with -revert, otherwise it enables -progressive by default.

  4. For research, try jpgcrush, jpegrescan and mozjpeg's jpegtran -optimize (all of these create progressive JPEG). Chances are thet mozjpeg always produces the smallest output, so the other two don't have to invoked by pdfsizeopt. Also upgrade the PDF version number to at least 1.3.

rbrito commented 7 years ago

Hi, Péter.

On Sep 15 2017, Péter Szabó wrote:

  1. EXIF tags etc. can be removed by Python code, see Removing JPEG metadata (e.g. comments, JFIF, Exif etc.) manually in info.txt for details. Don't use jpegtran -copy none, because it would keep some unnecessary metadata.

Nice to know and highly useful for "anonymization" (or, at least, one step further towards it) of PDF files. Another option is to use jhead -purejpg, but using fewer dependencies is, of course, better.

  1. Smaller Huffman tables can be generated. jpegtran -optimize can do this. mozjpeg's jpegtran doesn't create any smaller files. This is equivalent to the lossless mode of jpegoptim and imgopt.

jpegtran is widely available on Linux distributons and I suspect that it can be made small even if you statically link it. Didn't know that mozjpeg's jpegtran didn't create smaller files wrt to Huffman tables. I assumed that it did.

  1. jpgcrush and jpegrescan cannot be used, because they create progressive JPEG output, which PDF doesn't support.

Didn't know that.

Anyway, optimizing the JPEG files is awesome for the future!

Thanks once again,

-- Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br

.

rbrito commented 7 years ago

Is this too hard to implement? It would rock to have this, since there are many files that have JPEGs embedded that could use some optimization of the kind of jpegtran...

Regarding jpgcrush & co., I don't think that it would be a problem to upgrade PDFs to version 1.3... Almost every PDF out there is with a version later than this...

pts commented 7 years ago

Is this too hard to implement?

No, it's relatively straightforward, but it needs time to implement, which I don't have immediately. (Funding could make a difference here.) The first step would be running jpgcrush, jpegrescan and mozjpeg's jpegtran -optimize on many JPEG files in PDFs, and figuring out if there is a clear winner (i.e. smallest output size). If there is a winner, it should be compiled and added to pdfsizeopt_libexec, and then the calling code should be added to pdfsizeopt. The removing of metadata is also not hard to implement. It's about 8 hours of work in total.

Regarding jpgcrush & co., I don't think that it would be a problem to upgrade PDFs to version 1.3...

Correct, pdfsizeopt could do this automatically. In some cases it already does.

rbubley commented 6 years ago

If you ever get a chance to look at this, another optimizer worth looking at is jpegoptim.

rbrito commented 4 years ago

Just for the record, I packaged jpgcrush for Debian-based systems (properly patched to be Debian-friendly with files not under /usr/local etc.).

With it installed, I can use a script of mine to call jpgcrush on some JPEG files (depending on the colorspace---I know how to handle RGB and some ICC-based JPEGs).

pts commented 1 year ago

FYI New version of the script is available at: https://github.com/rbrito/scripts/blob/master/optimize_pdfs.py

pts commented 1 year ago

FYI The design is the following.

For each image object with its last filter being /DCTDecode: uncompress everything exept for /DCTDecode.

For each image object with /Filter /DCTDecode:

  1. Parse the JPEG bitstream (i.e. the stream data of the PDF image object), find all markers until the JPEG image data itself.
  2. If the JFIF marker is missing, display a warning, give up, and keep the image object unchanged.
  3. If the JPEG dimensions and number of components don't match the image dimensions, display a warning, give up, and keep the image object unchanged.
  4. If the Adobe APP14 (APPE) marker is peresent (this is the one with /ColorTransform), save its contents for later use.
  5. Remove all APP* and COM markers (including JFIF). This removes all metadata (including Exif, XMP, IPTC and ICC).
  6. If a SOF1, SOF3 ... SOF15 marker (any) is detected, display a warning, give up, and keep the image object unchanged.
  7. If an unknown marker is detected (i.e. not on the whitelist), display a warning, give up, and keep the image object unchanged.
  8. Truncate the file after the EOI marker (2 bytes: '\xff\xd9).
  9. Add a JFIF marker without thumbnail (18 bytes). Add back the Adobe APP14 marker (if it was present).
  10. Save the the result to source.jpg.
  11. Run a bunch of inernal and external JPEG image optimizers on source.jpg, creating files opt1.jpg ... optN.jpg.
  12. Read each optK.jpg, remove all APP* and COM markers (including JFIF), add a JFIF marker without thumbnail (18 bytes), add back the Adobe APP14 marker (if it was present).
  13. Choose the smallest of source.jpg and all optK.jpg, replace the stream of the PDF image object with it.
  14. If /ColorTransform is specified in the image object, and it either has the default value, or the Adobe APP14 marker was also present, drop the /ColorTransform.
  15. Bump the PDF version to 1.3 if the chosen stream is a progressive JPEG (SOF2 marker) rather than baseline JPEG (SOF0 marker).

JPEG image optimizers included by default:

There will be a --use-jpeg-optimizer=... command-line flag (similar to --use-image-optimizer, maybe they will be merged) to specify which JPEG image optimizers to try. If will be possible to use other optimizers (as external programs) as well, in addition those included.

Additional external JPEG image optimizers, not included by default, but can be used:

A tiny benchmark (all JPEG files with the JFIF marker, without the Adobe APP14 marker, 2560x1440, YCbCr4:2:2, Huffman coding):

zvezdochiot commented 1 year ago

@pts say:

jpegtran

jpegtran only makes sense if the task is to split one object into several smaller ones. To control the size/quality of an object, jpegoptim should be used.

pts commented 1 year ago

@zvezdochiot,

jpegtran only makes sense if the task is to split one object into several smaller ones.

This is incorrect. jpegtran of mozjpeg can do lossless optimizations on a JPEG file (in fact, it does that by default if no command-line flags are given, and also when -copy none is given). I have tried this and got some file size reductions.

To control the size/quality of an object, jpegoptim should be used.

It is a matter of opinion which program should be used for lossless JPEG image optimization (e.g. progressive scanline reordering, Huffman table optimization). jpegoptim, jpgcrush and jpegoptim are all useful for this purpose. pdfsizeopt will have some defaults, but it will also be able to use as many JPEG optimizers as the user wants (and then pdfsizeopt will pick the smallest output).

pts commented 1 year ago

@zvezdochiot, I've extended my comment with some info about how jpegoptim can be used by pdfsizeopt in the future.

zvezdochiot commented 1 year ago

@pts say:

for lossless JPEG image optimization

jpegoptim not only lossless. See option -m. ;)

pts commented 1 year ago

jpegoptim not only lossless. See option -m. ;)

True, there is no confusion about this fact. My comment https://github.com/pts/pdfsizeopt/issues/41#issuecomment-1445066996 also indicates that jpegoptim is not only lossless.

This issue is about lossless JPEG image optimization only, so the lossy features of jpegoptim shouldn't be used here. See https://github.com/pts/pdfsizeopt/issues/95 for lossy JPEG image optimization in pdfsizeopt.

pts commented 1 year ago

@zvezdochiot, do you have some JPEG files to share for which jpegoptim (jpegoptim -s --all-progressive) generates very small output? I suspect that jpegtran of mozjpeg will generate smaller output, but we should try and compare.

zvezdochiot commented 1 year ago

@pts say:

do you have some JPEG files to share for which jpegoptim

No. I am now far from this topic. I'm working on automating PDF splitting into objects using csplit and replacing entire /DCT objects with /CCITT (used LibTIFF+gs) and /Flate (used sam2p+gs).

Those. I am busy with the topic of clean+lossy, not lossless.

Adreitz commented 1 year ago

Just a couple other utilities to mention related to lossless JPEG optimization. It's very slow because it is brute force, but jpegultrascan (https://github.com/MegaByte/jpegultrascan) seems to produce the best results for me. I've used it simultaneously calling the IJG and mozjpeg versions of jpegtran. There is also pingo (https://css-ig.net/pingo.php), which is mostly a PNG optimizer but also does some limited, but very fast, JPEG optimization. Pingo is a great bang-for-buck utility, but it is Windows-only and closed-source.

pts commented 1 year ago

@Adreitz, thank you for suggesting jpegultrascan and pingo.

According to my quick measurements, jpegultrascan is about 1450 times slower than jpegtran of mozjpeg for a ~100 KiB JPEG input file, and makes the output JPEG file 0.402% smaller on average than jpegtran of mozjpeg. Thus it won't be included with pdfsizeopt.

Because it's a Windows-only tool, pingo won't be included either. Also, pingo.exe is huge (almost 3 MiB) considering what it can do.

However, it will be possible to use them (and many others) with --use-jpeg-optimizer=....

pts commented 1 year ago

I was wondering which version of jpegtran to use. I've tried mozjpeg version 3.1 (2015-05-19) and version 4.1.1 (2022-09-15). The latter generated JPEG output files a few percent larger than the former. I've reported https://github.com/mozilla/mozjpeg/issues/433 to get advice.

pts commented 1 year ago

I've created a source port of jpegtran of mozjpeg on https://github.com/pts/pts-mozjpegtran. The executable programs used by pdfsizeopt will be built from these sources, cross-compiled on Linux.

GitHubRulesOK commented 1 year ago

Hmmm I come from a long time ago when the need for compression was, very high cost of media and handling, thus compression was an essential necessity for storage and transmission speeds. However the purpose of a modern PDF is to be readable by many as fast as possible, thus size of storage is no longer a necessity, its bangs per buck that is achieved by decompression/rendering time that is important in this modern century. A few bytes saved by one authors exotic compressions (such as internet archive seem to use?) is magnified as a global warming delay, when each of thousands of readers take, extra seconds and CPU power to decompress for print or screen render. There should be a balancing point where simpler decompression times outweighs a few KB saved.

Adreitz commented 1 year ago

@GitHubRulesOK This is not "exotic compression". The compression formats relevant to PDFs (JBIG2, JPEG, GIF, PNG, ZIP) are all well-known and asymmetric. The decompression time is not affected in an appreciable way by the compression time used. While there is a time and energy cost associated with optimized compression, it can easily be made up by the use of the optimized file. It is quicker to download and therefore uses less energy through network activity. It could even be marginally quicker to open if stored on a fairly slow medium. And, though the storage space recovered from optimizing a single file may not be very meaningful, it adds up, and I don't think anyone would complain if they found their drive to suddenly have 10-30% greater capacity.

Also, the JPEG optimizations discussed in this issue are generally very quick, as opposed to many PNG optimizations.