Open pts opened 7 years ago
Hi, Péter.
On Sep 15 2017, Péter Szabó wrote:
- EXIF tags etc. can be removed by Python code, see Removing JPEG metadata (e.g. comments, JFIF, Exif etc.) manually in info.txt for details. Don't use
jpegtran -copy none
, because it would keep some unnecessary metadata.
Nice to know and highly useful for "anonymization" (or, at least, one step
further towards it) of PDF files. Another option is to use jhead -purejpg
,
but using fewer dependencies is, of course, better.
- Smaller Huffman tables can be generated.
jpegtran -optimize
can do this. mozjpeg's jpegtran doesn't create any smaller files. This is equivalent to the lossless mode ofjpegoptim
andimgopt
.
jpegtran is widely available on Linux distributons and I suspect that it can be made small even if you statically link it. Didn't know that mozjpeg's jpegtran didn't create smaller files wrt to Huffman tables. I assumed that it did.
- jpgcrush and jpegrescan cannot be used, because they create progressive JPEG output, which PDF doesn't support.
Didn't know that.
Anyway, optimizing the JPEG files is awesome for the future!
Thanks once again,
-- Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br
.
Is this too hard to implement? It would rock to have this, since there are many files that have JPEGs embedded that could use some optimization of the kind of jpegtran...
Regarding jpgcrush & co., I don't think that it would be a problem to upgrade PDFs to version 1.3... Almost every PDF out there is with a version later than this...
Is this too hard to implement?
No, it's relatively straightforward, but it needs time to implement, which I don't have immediately. (Funding could make a difference here.) The first step would be running jpgcrush, jpegrescan and mozjpeg's jpegtran -optimize
on many JPEG files in PDFs, and figuring out if there is a clear winner (i.e. smallest output size). If there is a winner, it should be compiled and added to pdfsizeopt_libexec, and then the calling code should be added to pdfsizeopt. The removing of metadata is also not hard to implement. It's about 8 hours of work in total.
Regarding jpgcrush & co., I don't think that it would be a problem to upgrade PDFs to version 1.3...
Correct, pdfsizeopt could do this automatically. In some cases it already does.
If you ever get a chance to look at this, another optimizer worth looking at is jpegoptim.
Just for the record, I packaged jpgcrush for Debian-based systems (properly patched to be Debian-friendly with files not under /usr/local
etc.).
With it installed, I can use a script of mine to call jpgcrush
on some JPEG files (depending on the colorspace---I know how to handle RGB and some ICC-based JPEGs).
FYI New version of the script is available at: https://github.com/rbrito/scripts/blob/master/optimize_pdfs.py
FYI The design is the following.
For each image object with its last filter being /DCTDecode
: uncompress everything exept for /DCTDecode
.
For each image object with /Filter /DCTDecode
:
stream
data of the PDF image object), find all markers until the JPEG image data itself./ColorTransform
), save its contents for later use.'\xff\xd9
)./ColorTransform
is specified in the image object, and it either has the default value, or the Adobe APP14 marker was also present, drop the /ColorTransform
.JPEG image optimizers included by default:
-copy none
flag. It optimizes much more by default than traditional jpegtranjpegtran -revert
in mozjpeg. Extra testing will be done to make sure it works with 4 components (e.g. /Colorspace /DeviceCMYK
).There will be a --use-jpeg-optimizer=...
command-line flag (similar to --use-image-optimizer
, maybe they will be merged) to specify which JPEG image optimizers to try. If will be possible to use other optimizers (as external programs) as well, in addition those included.
Additional external JPEG image optimizers, not included by default, but can be used:
/DCTDecode
doesn't support that. It's currently not thoroughly tested whether jpegoptim in lossless mode can create much smaller files than jpegtran of mozjpeg.A tiny benchmark (all JPEG files with the JFIF marker, without the Adobe APP14 marker, 2560x1440, YCbCr4:2:2, Huffman coding):
jpegoptim -s --all-progressive
, progressive, without metadatajpegtran -copy none
with jpegtran of mozcjpeg, progressive, without metadata@pts say:
jpegtran
jpegtran
only makes sense if the task is to split one object into several smaller ones. To control the size/quality of an object, jpegoptim should be used.
@zvezdochiot,
jpegtran
only makes sense if the task is to split one object into several smaller ones.
This is incorrect. jpegtran of mozjpeg can do lossless optimizations on a JPEG file (in fact, it does that by default if no command-line flags are given, and also when -copy none
is given). I have tried this and got some file size reductions.
To control the size/quality of an object, jpegoptim should be used.
It is a matter of opinion which program should be used for lossless JPEG image optimization (e.g. progressive scanline reordering, Huffman table optimization). jpegoptim, jpgcrush and jpegoptim are all useful for this purpose. pdfsizeopt will have some defaults, but it will also be able to use as many JPEG optimizers as the user wants (and then pdfsizeopt will pick the smallest output).
@zvezdochiot, I've extended my comment with some info about how jpegoptim can be used by pdfsizeopt in the future.
jpegoptim not only lossless. See option
-m
. ;)
True, there is no confusion about this fact. My comment https://github.com/pts/pdfsizeopt/issues/41#issuecomment-1445066996 also indicates that jpegoptim is not only lossless.
This issue is about lossless JPEG image optimization only, so the lossy features of jpegoptim shouldn't be used here. See https://github.com/pts/pdfsizeopt/issues/95 for lossy JPEG image optimization in pdfsizeopt.
@zvezdochiot, do you have some JPEG files to share for which jpegoptim (jpegoptim -s --all-progressive
) generates very small output? I suspect that jpegtran of mozjpeg will generate smaller output, but we should try and compare.
@pts say:
do you have some JPEG files to share for which jpegoptim
No. I am now far from this topic. I'm working on automating PDF splitting into objects using csplit
and replacing entire /DCT
objects with /CCITT
(used LibTIFF
+gs
) and /Flate
(used sam2p
+gs
).
Those. I am busy with the topic of clean+lossy, not lossless.
Just a couple other utilities to mention related to lossless JPEG optimization. It's very slow because it is brute force, but jpegultrascan (https://github.com/MegaByte/jpegultrascan) seems to produce the best results for me. I've used it simultaneously calling the IJG and mozjpeg versions of jpegtran. There is also pingo (https://css-ig.net/pingo.php), which is mostly a PNG optimizer but also does some limited, but very fast, JPEG optimization. Pingo is a great bang-for-buck utility, but it is Windows-only and closed-source.
@Adreitz, thank you for suggesting jpegultrascan and pingo.
According to my quick measurements, jpegultrascan is about 1450 times slower than jpegtran of mozjpeg for a ~100 KiB JPEG input file, and makes the output JPEG file 0.402% smaller on average than jpegtran of mozjpeg. Thus it won't be included with pdfsizeopt.
Because it's a Windows-only tool, pingo won't be included either. Also, pingo.exe is huge (almost 3 MiB) considering what it can do.
However, it will be possible to use them (and many others) with --use-jpeg-optimizer=...
.
I was wondering which version of jpegtran to use. I've tried mozjpeg version 3.1 (2015-05-19) and version 4.1.1 (2022-09-15). The latter generated JPEG output files a few percent larger than the former. I've reported https://github.com/mozilla/mozjpeg/issues/433 to get advice.
I've created a source port of jpegtran of mozjpeg on https://github.com/pts/pts-mozjpegtran. The executable programs used by pdfsizeopt will be built from these sources, cross-compiled on Linux.
Hmmm I come from a long time ago when the need for compression was, very high cost of media and handling, thus compression was an essential necessity for storage and transmission speeds. However the purpose of a modern PDF is to be readable by many as fast as possible, thus size of storage is no longer a necessity, its bangs per buck that is achieved by decompression/rendering time that is important in this modern century. A few bytes saved by one authors exotic compressions (such as internet archive seem to use?) is magnified as a global warming delay, when each of thousands of readers take, extra seconds and CPU power to decompress for print or screen render. There should be a balancing point where simpler decompression times outweighs a few KB saved.
@GitHubRulesOK This is not "exotic compression". The compression formats relevant to PDFs (JBIG2, JPEG, GIF, PNG, ZIP) are all well-known and asymmetric. The decompression time is not affected in an appreciable way by the compression time used. While there is a time and energy cost associated with optimized compression, it can easily be made up by the use of the optimized file. It is quicker to download and therefore uses less energy through network activity. It could even be marginally quicker to open if stored on a fairly slow medium. And, though the storage space recovered from optimizing a single file may not be very meaningful, it adds up, and I don't think anyone would complain if they found their drive to suddenly have 10-30% greater capacity.
Also, the JPEG optimizations discussed in this issue are generally very quick, as opposed to many PNG optimizations.
EXIF tags etc. can be removed by Python code, see Removing JPEG metadata (e.g. comments, JFIF, Exif etc.) manually in info.txt for details. Don't use
jpegtran -copy none
, because it would keep some unnecessary metadata.Smaller Huffman tables can be generated.
jpegtran -optimize
can do this. mozjpeg's jpegtran doesn't create any smaller files (this is not true, double check the size difference). This is equivalent to the lossless mode ofjpegoptim
andimgopt
.This is only true for PDF 1.2 and earlier: jpgcrush and jpegrescan cannot be used, because they create progressive JPEG output, which PDF doesn't support. FYI If mozjpeg's jpegtran is used, then it should be invoke with
-revert
, otherwise it enables-progressive
by default.For research, try jpgcrush, jpegrescan and mozjpeg's
jpegtran -optimize
(all of these create progressive JPEG). Chances are thet mozjpeg always produces the smallest output, so the other two don't have to invoked by pdfsizeopt. Also upgrade the PDF version number to at least 1.3.