Consider comparing only one variable at a time

kornelski commented 9 years ago

Image file sizes often grow exponentially(ish) with quality, so even barely noticeable differences in quality can give you wildly different file sizes and misleading results (more on that https://pornel.net/faircomparison).

When comparing codecs usually the goal is to find one with best quality/filesize ratio — even if you plan to convert images at a fixed quality setting, the aim is still to get best "bang for the buck" (i.e if the fixed quality setting turns out to be too high for an image, you still want to minimize filesize cost of that).

However, when in test results you have a result where the file is bigger, but quality is also better, it's impossible to tell whether quality/filesize ratio is better or worse, because that relationship is complex and non-linear.

The best way is to compare only one variable (keeping the other the same), and ensure results are in the same order of magnitude to avoid one exceptionally large value from deciding fate of the whole set (e.g. imagine comparing savings on images in KB range, and one in multi-MB range. No matter how good or bad the small images are, even small changes to the multi-MB one will be bigger than that, shadowing all other results).

For example if you're planning to convert images to a fixed quality setting, adjust that setting so that the sum of file sizes for all codecs is the same, and then compare quality.

Or convert images to the same, precisely measured quality, and then file sizes will give you indication how much on average you can save.

okor commented 9 years ago

So I'll elaborate on the purpose of why I built this little script/app just a tad, to add some context.

We (Vox Media) have millions of images that we store and ultimately serve to users. The source images come into our system via any number of sources. They could be a screenshots (png) or photo from an image service like Getty (jpg or png) or something else.

A large majority of the images we serve on our sites pass through a self hosted image service (https://github.com/thumbor/thumbor) which we use to resize and encode images. The way thumbor handles image compression setting is a simple a fixed "quality" integer applied to each format. For webp for instance, we use 80, I believe.

So before I did a bunch of work to extend the same kind of compression benefits we see from webp to our Safari users (add j2k to thumbor), I wanted to evaluate roughly 1. if j2k would provide any kind of compression savings at all and 2. could we retain visual similarity from source to encoded within acceptable limits.

So this script applies a fixed quality setting to each encoder or optimizer. This accurately replicates what would happen if implemented via thumbor. Thumbor does not do adaptive compression to ensure visual similarity. It's "dumb".

So I used this test set to see what would happen to our actual images if we ran them through a fixed quality setting for each format. I did the test a bunch of times, tweaking the quality values for each encoder until it seemed roughly sane. And then I ran a test over about 10 thousand images out of the total library of millions of images, to see what kind of images we would be serving to our users.

The demo server is very small sample of only 30 images or so to act as an example of what the UI would look like if you do a larger test with a significant sample size. But the 10k test actually showed very similar results. It just created hundreds of gigabytes of images so I only ran that test locally.

With the exception of mozjpeg, all source images are first encoded to png format at a 100 quality setting. Then the image is converted to the target format. For mozjpeg I used ppm at 100 quality.

After an image is converted to the target format (or just "optimized"), the image is converted back to a png. I use these two png sources when I compare using DSSIM and generate the diffs.

File size is a direct comparison between the actual source image and the target encoding (j2k, webp, etc).

So what I was using DSSIM for is to tune the compression settings to the point where I could get the most consistent and mostly lower than 1.5% visual diff.

So I agree that I am not comparing apples to apples. But this is realistic data, because I will be applying a fixed quality setting to a very large set of images (just like most people will). I will also not be able to control the "quality" of the input formats in the real world, they will vary drastically.

Am I crazy?

okor commented 9 years ago

I will preemptively add that if after reading my response you still think I should adjust my testing methodology to more accurately reflect what kind of real world results I will be getting ... then I will gladly modify the process to use what you describe is the easier case, comparing exactly the same file size.

I'm also open to a pull request if you'd prefer.

kornelski commented 9 years ago

It's very good that you're testing on a large, diverse sample.

It doesn't matter much than you intend to use a fixed quality setting, because what you're trying to measure is the quality/filesize ratio, and not each variable individually (If you only cared only about quality, you would serve PNG to everybody. If you only cared about file size, regardless of quality, you'd serve JPEG at lowest quality setting). What you really care is getting the best quality for smallest file size.

I hear you're saying you're just going to set some fixed quality, and want to see if that makes files smaller. However, that doesn't test quality of the codec well:

If you set J2K to an arbitrary quality significantly lower than the other codecs, then you'll get file sizes lower than the rest. J2K will "win" as more efficient.
If you set J2K to an arbitrary quality significantly higher than the other codecs, then you'll get file sizes higher than the rest. J2K will "lose" as less efficient.

So results of a test on an arbitrary uncontrolled quality is mostly just a reflection of the arbitrary setting you've chosen, and is not a good indication of the codec's performance.

Try the same codec at two significantly different settings, and you'll see that it can both win and lose that test at the same time!

okor commented 9 years ago

Thanks for the feedback!

It's possible I may have reinvented a wheel here. I'm curious if you know of some tool that might already be out there (open source) which implements the kind of quality/filesize ratio experiment that I am trying to conduct here ... but the implementation is closer to the spec you have described. Know of anything?

Ideally, we would have an image processing server which was a bit smarter so this experiment would be less necessary. Rather than specify a "quality" ... specify a DSSIM/SSIM tolerance. So, "make this image as small as possible while within this DSSIM tolerance", would be the goal. But to the best of my knowledge (please correct me if I am wrong) encoders don't have this kind of capability built in, so the parent application would need to use some type of iterative process for each image ... which would be very slow and not well suited for something that should generated a resized and encode image quickly in an HTTP request cycle. Something akin to Tobias Baldauf's cjpeg-dssim. I have however started dreaming up an application design that does just what outlined here but to get around the speed issue, return a "safe" image quickly but set the cache headers to immediately expire. Then in the background, do all of the hard iterative work. Then once the "best" image is determined, respond to all later requests with that "best" image along with long lived cache headers. That way your CDN or Varnish or whatever can hold onto that image forever.

kornelski commented 9 years ago

Are you asking for a solution to conduct an experiment, or for something to use in production?

If for purpose of the experiment, then even a bash script that does bisection is OK (you can easily adapt cjpeg-dssim).

For production, imgmin is more efficient, since all compression and testing is done in memory without temp files. I think jpeg-archive also does something like this, but I haven't used it.

okor commented 9 years ago

I was asking for a solution for the experiment. Your hint that a bisection strategy would suffice sounds good to me.

I think my mind is in a good place right now. Thanks for all your help!

voxmedia / image_compression_experiments

Consider comparing only one variable at a time #3