Set up benchmark with MagicScaler

imazen / imageflow

High-performance image manipulation for web servers. Includes imageflow_server, imageflow_tool, and libimageflow

https://docs.imageflow.io/

GNU Affero General Public License v3.0

4.18k stars 139 forks source link

Set up benchmark with MagicScaler #71

Closed lilith closed 8 years ago

lilith commented 8 years ago

Design a way to compare MagicScaler and Imageflow performance and image quality

Here's a copy of the test harness that produced all my numbers and screenshots: https://github.com/saucecontrol/ScalerVisualTest.

I'd love to see Imageflow in there. I grabbed your prototype but was unable to do any meaningful comparison since it only does one operation per process instance. I can't compare MagicScaler on those terms because I'm stuck with a ~100ms JIT penalty on my first operation per process.

We'd definitely have to wait for C# FFI bindings to make that harness work as-is with Imageflow; could be while.

What if I add a benchmark loop to the prototype so it can process the same image X many times in a single invocation? Would that let you do a meaningful benchmark?

saucecontrol commented 8 years ago

I'm not sure how to address that, really. I considered trying to establish a baseline value for the process-launch overhead in Imageflow, but then you have to figure in one-off costs like LUT initialization. Same thing on the MagicScaler side, except worse -- since different options or input images can trigger different code paths that have to be JITted.

For serial speed evaluation, an inner loop and a timings dump would be helpful for sure. For parallel, I don't think we'll be able to compare fairly until they can be loaded in the same process.

I don't know if you've tried to run my benchmark harness yet, but until I get a nuget package published, you'll have to plug in MagicScaler from here. My beta builds include my reference GDI+ and WIC resizers as well.

lilith commented 8 years ago

Are you looking at end-to-end perf under specific conditions or do you want to benchmark scaling exclusively? Given we're using different codecs the end-to-end perf is going to vary a lot. Only a small percentage of runtime is outside of the codec.

saucecontrol commented 8 years ago

As far as I'm concerned, end-to-end is the only meaningful comparison. The codec differences are definitely going to complicate things, but if we can establish a set of compatible JPEG encoder settings, I think we can make it fair.

With my pipeline there's no way to separate performance anyway. It pulls one target MCU-line through at a time. I'm not doing anything at all with compression-optimized outputs (e.g. progressive scan or optimized Huffman tables). I tend to believe the first request for an image should follow the absolute fastest path possible to the end user, and optimization (such as jpegtran) should be done on the cached copy of the image offline for the benefit of future requestors.

I did notice, though, that the Imageflow prototype outputs progressive JPEG. Is that something you planned on making configurable?

lilith commented 8 years ago

All JPEG encoders derive their quantization tables in different ways; duplicating even that is usually impossible through the controls exposed.

We're going to make jpeg encoding highly configurable. If funding permits, we'll make perceptual adaptive encoding happen, since the quality number is so spectacularly useless in reality. Imgmin and jpeg-archiver do this now, but take several seconds/file.

Maybe we should benchmark bitmap->bitmap processing if we want to compare scaling performance?

saucecontrol commented 8 years ago

Yeah, bitmap to bitmap would definitely be great for comparing scaler performance. If that's something you think would be a useful addition to Imageflow, I'm all for it.

Of course we can't match encoder work exactly, but I don't think that's a deal-breaker. We have different architectures and different pipelines overall. And of course my project isn't cross-platform, so we can only see how things compare on Windows anyway.

For me, it's just about seeing whether the overall task can be done any faster or more efficiently. If you can beat me at a given task, as long as the definition of that task is close enough to be comparable, I think that's worth knowing. I'd imagine it's the same for you. ImageMagick is far too easy a target to beat performance-wise, so I'm just interested in upping the level of competition (again, in a friendly way)

lilith commented 8 years ago

ImageMagick is far too easy a target to beat performance-wise, so I'm just interested in upping the level of competition (again, in a friendly way)

Agreed, although they're adding GPU acceleration to more components quite regularly. Libvips is currently a more competitive target; have you benchmarked against it?

I'm up for adding a bitmap codec. What would the harness look like? Could the processes self-benchmark and exclude warmup time to help compensate for .NET JIT delay?

saucecontrol commented 8 years ago

Given that I'm targeting the .NET Framework, I haven't looked at benchmarking against anything that wouldn't be easily accessible there, so libvips is right out. Originally that left me with only GDI+, Paint.NET, FreeImage, and the like. I was actually excited when I found FastScaling as it seemed to be the only thing attempting to address performance and correctness in the .NET universe.

I think your harness description sounds completely reasonable. I can throw together a CLI for MagicScaler quite quickly that matches the operations of the Imageflow prototype. Maybe have the normal output be used for a warmup and add a switch that just repeats that same operation n more times writing to memory instead of disk?

lilith commented 8 years ago

Sounds good. libvips has 'vipsthumbnail' as a standalone binary. Startup overhead is 15ms for it on Ubuntu 14.04, but I don't have a windows workstation to see what the windows overhead is.

I'll need to add memory-based I/O to our afternoon prototype now :)

So, runtime benchmark:

Do operation with actual file I/O, produce result. (as warmup)
Load file into memory.
Run X executions on N threads, wait for all to complete. Report timings to STDOUT.
Default parameters: linear scaling w/ Mitchell filter, perhaps?

saucecontrol commented 8 years ago

I like it.

The only issue I can see there is that MagicScaler doesn't do intermediate scaling in linear light since it uses the decoder or the WIC scaler (or both) for the intermediate step. The only way to get 100% linear scaling in the current build is to disable hybrid mode.

Mitchell is awfully blurry for comparing visual results, especially if intermediate averaging has been done. If your default is a 2-lobe Lanczos, that sounds better me. Is your Lanczos implementation standard?

lilith commented 8 years ago

To mimic our default, use ImageWorsener with -filter cubic0.37821575509399867,0.31089212245300067 -blur 0.85574108326. The blur value puts the zero crossings at -1, 1, like lanczos, but is a slightly different shape.

Lanczos implementations vary by windowing function, so it's actually hard to mimic those perfectly unless we nail down the math first. Is there a Bicbic family filter you like?

Intermediate averaging is a big problem. FastScaling converts to linear before using intermediate scaling, although linear conversion destroys 70% of the speedup. This happens outside of the codec, and only when the input/output are multiples of the block size (so it rarely shows up in benchmarks).

Intermediate scaling in a jpeg codec just drops frequency domain data, so it's essentially a worse version of spatial, gamma-incorrect averaging. It's crazy fast to throw away data, though!

We forked the jpeg codec to integrate a spatial, linear, lanczos-like filter into the block downscaling. Instead of being a 3x-5x speedup it's more like a 1.5x, but the quality is good, although additional blurring at block edges would be ideal if you're not post-scaling by a large ratio. I'm going to run ResampleScope on the output when I get a chance and see if it's possible to tweak further.

It's unfortunately no good to resample in linear if you do intermediate scaling in sRGB, because the lion's share of data loss has already occurred by averaging numerically compressed values.

saucecontrol commented 8 years ago

I can replicate that filter in MagicScaler easily, but I'd prefer a full-width cubic if Imageflow can do that. Catmull-Rom (b=0, c=.5, blur = 1) would be my preference.

I'm well aware of the quality cost of intermediate scaling, which is why I spent a good deal of effort making sure my scaler was performant on its slow path. If you really want quality, I think you do the whole thing in linear light with high-quality scaling. In fact, most users are using MagicScaler that way exclusively. It would be possible to add an extra hybrid mode to MagicScaler to force linear for the whole thing while still allowing averaging, but it has no such implementation now, and I'm not sure those requirements aren't at odds with one another.

I disagree that there is no value in using hybrid linear light with hybrid scaling. As long as there is enough work left to do in the final step, there can still be benefit to doing it linear even if some data was lost in the intermediate step. It's highly image-dependent, of course, but for any potential quality advantage or issue, there's always an image somewhere that shows it even if most don't.

lilith commented 8 years ago

Catmull-Rom it is.

I disagree that there is no value in using hybrid linear light with hybrid scaling. As long as there is enough work left to do in the final step, there can still be benefit to doing it linear even if some data was lost in the intermediate step. It's highly image-dependent, of course, but for any potential quality advantage or issue, there's always an image somewhere that shows it even if most don't.

Given that the most loss occurs at small high-contrast points (small white lines, highlights, reflections), how does that work? I don't have a harness set up to measure this, but, subjectively, how much hybrid scaling can ImageScaler do on, say, this image before visible perceptual loss occurs when compared to pure linear resampling?

saucecontrol commented 8 years ago

Well, that's a tricky question, because visible perceptual loss is a given. The more appropriate question is whether the results are satisfactory or whether they are an improvement over gamma-compressed scaling. I'd say in this case there was marked improvement. The streams in the fireworks are more defined and more of the lights on the horizon are visible.

fireworkshybridlinear

In fact, this is one of those rare images where the 'correct' output may appear to some people to be overly brightened and the hybrid result may be more appealing.

In this case, MagicScaler had a 2:1 scale in the decoder and the remaining 3.9:1 done with the high-quality scaler in linear light. The FastScaling example did whatever FastScaling does with this image when you set down.colorspace=linear.

lilith commented 8 years ago

In fact, this is one of those rare images where the 'correct' output may appear to some people to be overly brightened and the hybrid result may be more appealing.

And very broken scaling filters sometimes produce pretty output, on certain images, at certain scaling ratios. I think that linear light scaling less arguable than which scaling filter to use.

Why not diff the 'pure' results and the hybrids to see how far off they are? Use &down.speed = -2 to disable FastScaling's pre-scaling if you like. We could say that more than a 1% difference is problematic, perhaps? I use compare a.png b.png -fuzz 1% diff.png in ImageMagick for this.

saucecontrol commented 8 years ago

Why not diff the 'pure' results and the hybrids to see how far off they are?

Because neither implementation supports separating the two. FastScaling can't do hybrid linear, and MagicScaler can't do hybrid linear without hybrid scaling. You can't measure whether one is more important than the other if you can't separate the two. And which is more important will vary from image to image.

I think that linear light scaling less arguable than which scaling filter to use.

I think we'll have to agree to disagree on that point. You're going to implement the features you think are important, and I'm going to implement the features I think are important. If they happen to overlap, we have an opportunity for benchmarking. I think we've both implemented a max-correctness path and a max-speed path. Those should be a good start for comparison. Any time you get into hybrids, it becomes more a matter of which sacrifices are acceptable to whom, and the evaluation of the results is much more subjective.

lilith commented 8 years ago

Because neither implementation supports separating the two. FastScaling can't do hybrid linear, and MagicScaler can't do hybrid linear without hybrid scaling. You can't measure whether one is more important than the other if you can't separate the two. And which is more important will vary from image to image.

I'm suggesting you diff ImageScaler hybrid to ImageScaler linear results. FastScaling doesn't need to come into play for this.

Any time you get into hybrids, it becomes more a matter of which sacrifices are acceptable to whom, and the evaluation of the results is much more subjective.

I mean, given a baseline result from, say, ImageWorsener that we both consider 'correct', one can determine difference through DSSIM values and direct diffing, right? Does this actually have to be subjective?

saucecontrol commented 8 years ago

I'm suggesting you diff ImageScaler hybrid to ImageScaler linear results

But I can't do full-linear while doing hybrid scaling. And I can't do hybrid linear without hybrid scaling. The only thing I can compare is whether hybrid scaling with hybrid linear is better than hybrid scaling without. It is.

You're asking me to change two variables between tests, compare the results, and tell you how important one of the variables was. Can't be done.

I mean, given a baseline result from, say, ImageWorsener that we both consider 'correct', one can determine difference through DSSIM values and direct diffing, right?

Theoretically, yes. But since our hybrid implementations differ, we'd be testing the accuracy of the hybrids, not the performance. We can only test the performance meaningfully if we're doing the same work.

lilith commented 8 years ago

But I can't do full-linear while doing hybrid scaling. And I can't do hybrid linear without hybrid scaling. The only thing I can compare is whether hybrid scaling with hybrid linear is better than hybrid scaling without. It is.

We've overloaded hybrid so much I no longer follow. Let's give the scaling approaches numbers. We still have to specify which color channels, bit widths, and filters are being used, but maybe it will help?

Decoder scaling (gamma-incorrect, frequency domain) - usually YCbCr
Decoder scaling (gamma-incorrect, spatial domain, any filter) - Usually CbCr.
Decoder scaling (gamma-correct, spatial domain, any filter) - Usually Y
Block scaling (gamma-incorrect, pixel mixing) - Usually premult sRGB
Block scaling (gamma-incorrect, any filter) - Usually premult sRGB
Block scaling (gamma-correct, pixel mixing) - Usually premult linear RGB(A)
Block scaling (gamma-correct, any filter) - Usually premult linear RGB(A)
Resampling (gamma-incorrect, any filter) - Usually premult sRGB
Resampling (gamma-correct, any filter) - Usually premultiplied linear RGB(A)

1 through 7 suffer from incorrectly weighted edge pixels unless the image is a perfect multiple of 8 or the block size. Without post-processing, they also have unexpectedly sharp edges near the block, pixels from different blocks do not mix.

FastScaling employs methods 4 + 8 or 6 + 9. It doesn't have access to anything other than 24 or 32-bit BGR(a) input/output, so it can't be smart about, well, much at all.

Imageflow uses 1-3 and 8-9. By default, you can only enable 8, 8+1, 9, or 9+3. How would you describe what ImageScaler does, so that we can make sure we're talking about the same work?

But since our hybrid implementations differ, we'd be testing the accuracy of the hybrids, not the performance

From my perspective, you start by evaluating accuracy. If an approach is accurate enough, then you optimize it.

saucecontrol commented 8 years ago

Ah, that's helpful. I see where we're disconnected now.

Currently, MagicScaler does 8, 8+1, 9, or 9+1. What I'm understanding is that you're asking me to compare 9+1 with 9.

The problem with that is that 1 degrades the image quality in two different ways. It destroys high-frequency detail because that's what it does, and it destroys highlights/contrast because it's working with gamma-compressed data.

I can compare 8+1 with 9+1 because 8 only degrades the image in one way. And I can compare 8 with 8+1 because they both degrade the image in the same way, and then 1 does it in an additional way. Only one variable in each of those.

Comparing 9 with 9+1 will tell you the image is worse, but it won't tell you which of the two destructive factors from 1 caused the loss.

If I understand right, your position seems to be that one of those degradation factors is an acceptable loss and the other isn't. We can't prove that conclusively with that test. And even if we had the right supported modes to limit the test to a single variable, the results would still be image-dependent. When we sacrifice speed for quality, we assume that the loss we incur will be acceptable for the image in question. There are always exceptions to that. For that reason, if quality is your primary concern, the slow path should be the only right path. That's a path we both have implemented, so it seems like a good place to benchmark.

We also seem to have agreed that a no-image-is-sacred-just-do-it-fast path is also worth building. And that seems like a fair place to compare as well, provided our implementations are sufficiently similar.

We've disagreed on the middle ground, and I'm not sure there's a resolution to that in sight.

Actually scratch that... if I look at what we have in common, we both share 8, 9, and 8+1, so those are all valid tests. You have 9+3 where I have 9+1, and those obviously would have performance and quality differences, so that's not a valid comparison. I may still add 3 to my implementation at some point, but it's just not an option right now. And it seems you're quite opposed to 9+1, despite my having shown nice results with it on an image you selected. That's fine. There's no reason for you to implement something you don't believe in just to benchmark it.

lilith commented 8 years ago

For that reason, if quality is your primary concern, the slow path should be the only right path. That's a path we both have implemented, so it seems like a good place to benchmark.

Yes, the highest quality path is the best one for us to benchmark. I do spent a lot of time searching for faster paths that are within a rounding error of the highest quality, but those are hard to find; 90% of my experiments fail and are discarded.

Comparing 9 with 9+1 will tell you the image is worse, but it won't tell you which of the two destructive factors from 1 caused the loss.

True; I didn't realize you were using 1. I'd assumed you had a custom intermediate scaling algorithm based on your earlier statements and could change them independently.

If I understand right, your position seems to be that one of those degradation factors is an acceptable loss and the other isn't. We can't prove that conclusively with that test.

Strictly speaking, there are 3 degradation factors; the block-based scaling defects, the non-linear math, and the frequency domain loss. I'd assumed your code could separate out the third and vary the other two independently.

the results would still be image-dependent

Time for yardstick.pictures and dynamic Azure/AWS instances? We can independently control variables with some software.

worth building.

Much is worth building. I can't say I'll commit to retaining/maintaining anything that causes visible quality loss for my own use case; I tend to discard the algorithms that can't be made to reach a certain DSSIM of similarity (preferably a fixed percentage of off-by-one errors).

And it seems you're quite opposed to 9+1, despite my having shown nice results with it on an image you selected. That's fine. There's no reason for you to implement something you don't believe in just to benchmark it.

So, here's my problem with mixing linear and sRGB scaling algorithms: it means you can see visible color shifts depending upon the output scaling size. It's wacky to see an image change color just because of how it changes size. Color shifts due to sRGB scaling perpetuate themselves well through the other layers.

On the other hand, mixing different scaling filters or sampling methods - but staying in the linear space - has a greater falloff factor. You may see uneven sampling, but that diminishes as the low-quality to high-quality scaling ratio changes, and it's consistent regardless of output size. Whereas gamma-incorrect scaling (color) artefacts can be triggered by ratio, but perpetuate regardless of high-quality ratio afterwards.

Does this make sense? One can have bounded/consistent error, the other cannot.

A version of this image with additional various scalings of the patterns could be helpful in reproducing this aspect: https://s3-us-west-2.amazonaws.com/imageflow-resources/test_inputs/gamma_test.jpg

For reproducing the other aspect, rings work great: https://s3-us-west-2.amazonaws.com/imageflow-resources/test_inputs/rings2.png

And for verifying premultiplication correctness, I use https://s3-us-west-2.amazonaws.com/imageflow-resources/test_inputs/premult_test.png

saucecontrol commented 8 years ago

I didn't realize you were using 1. I'd assumed you had a custom intermediate scaling algorithm based on your earlier statements and could change them independently.

Yep, that's why it's all or nothing for me. Enabling hybrid mode enables DCT scaling, so you can only control the remainder ratio and how the remainder is processed at that point.

In cases where DCT scaling isn't available, I use WIC's Fant scaler (which works out to a very slightly incorrect box filter) as an alternate. I treat them as equivalent in the hybrid logic in that they're both fast but create blurring and aliasing. I feed the gamma-compressed data from the decoder to the WIC scaler for intermediate processing and then pick up the high-quality phase from there. That way the results are similar whether DCT scaling is available or not.

Of course, that's all subject to change, but that's how it works in the current builds.

So, here's my problem with mixing linear and sRGB scaling algorithms: it means you can see visible color shifts depending upon the output scaling size.

That's a good point. It's rare for that to happen noticeably, but I totally get your reasoning.

You may see uneven sampling, but that diminishes as the low-quality to high-quality scaling ratio changes, and it's consistent regardless of output size.

There are some images that will show moiré patterns that come and go with resize ratio when a low-quality scaler is involved as well.

This image just happens to exhibit both behaviors. It's the image that got me started writing MagicScaler.

I maintain that both techniques run the risk of bad reactions with certain images, but the performance gain of DCT scaling makes it attractive despite the fact that it can reveal either or both of these types of artifacts. Once you've thrown that out and decided to decode all the pixels and convert to linear light, I figure you may as well go ahead and process the whole thing with a high-quality scaler. I don't see a huge performance boost coming from the 9+3 approach, but I haven't tried it, so that's just a guess. It's certainly nowhere near the gains achievable from adding 1 into the mix ;)

I'm a fan of the gamma test and rings images, but I haven't seen that third one before. I'll check it out.

lilith commented 8 years ago

There are some images that will show moiré patterns that come and go with resize ratio when a low-quality scaler is involved as well. This image just happens to exhibit both behaviors. It's the image that got me started writing MagicScaler.

Even the highest-quality scalers are subject to moiré patterns, particularly Catrom. Anything that produces visually sharp results, will, with some images, cause that kind of artifact. Periodic overlaps in frequency space are impossible to avoid.

Whereas, linear processing can eliminate one category of flaws completely.

There's also the reality that our images are scaled a third time by the browser, possibly introducing additional moiré effects, taking at best an ~85% ideal solution and making it a ~70% solution.

The variance between a (correct scaler, but limited to non-overlapping blocks) and a (correct scaler over the entire image) is pretty small; essentially, 43% of pixels should be slightly blurrier, but post-processing can almost perfectly adjust for that, as it's deterministic. Thus, so far in Imageflow, my effort has been focused on optimizations with strictly bounded possibility for error, with the assumption that some people are willing to trade a 1% error ratio (strictly bounded) for some performance, but not optimizations that introduce unbounded error.

saucecontrol commented 8 years ago

That's a fair approach. It just comes down to what 'some performance' means. I'm really only willing to sacrifice quality if the performance gains are quite significant, and the gains from DCT scaling are certainly that. I may consider a 9+3 approach if I see enough performance benefit from it. For now we can test where our implementations overlap.

lilith commented 8 years ago

Off-topic: I want to thank you for applying ResampleScope to FastScaling. I hadn't come across ResampleScope before, and was instead relying on unit tests of my weighting functions/checksums of outputs. This blind spot has since be rectified - Imageflow now uses ResampleScope on all filters and generates ~4 thousand images for DSSIM comparison to ImageWorsener & ImageMagick to ensure results are pixel-perfect. As a result, I discovered two bugs in our weighting system: a multiply instead of add (typo?), responsible for the sampling of weights jumping to zero too early at some ratios, and a conflation of blur/ zero-crossing values that was making all filters have a zero-crossing of one, making them all act abnormally similar. Performance appears to have improved by ~11% (possibly due to skipping multiply-by-zero operations - something we changed for x-plat test result consistency), and I will be backporting the fixes to FastScaling. I also added an integration test to export the weights generated for each filter in a variety of ratios and ensure they are consistent across all our CI platforms.

For now we can test where our implementations overlap.

I think that's basically 9, as WIC Fant "A.K.A. Shark Fin" isn't really something I'm interested in comparing to, and I understand that is a fall-back path of your implementation of '1'. I can't agree that frequency domain loss and broken spatial scaling (let alone box scaling) are somehow equivalent. As far the Shark Fin goes, my guess is that they're using a fixed-pixel-count kernel as if pixel count and window lobe count were somehow interchangeable.

There may be a few differences:

Imageflow does not use premultiplication except when in floating-point (96-bit or 128-bpp precision) - this prevents destruction of information in areas with low alpha values. As watermarking is a common workflow, and watermark opacity is adjusted by users at runtime, this is not a contrived situation. Storing premultiplied data in 32-bit buffers is banned in both Imageflow and ImageResizer. Keep in mind that FastScaling was written to replace a specific use case for DrawImage - ImageResizer's, and that "baseline" DrawImage use reflects the best perf available given the quality and interoperability requirements of ImageResizer.
Imageflow does not currently offer a flag that disables color correction based on the included ICC profile.
Imageflow does not yet optimize for opaque images, and sticks to 32-bit (and the associated alpha channel management costs). This will be changed by the first beta release, but is not a priority for current prototypes.

The work described by 9, as implemented by the Imageflow prototype, would be:

Decode jpeg to YCbCr, then to 32-bit BGRA.
Apply ICC profile.
Load rows into 128-bit form and gamma-correct and premultiply.
Resample in both directions, in 128-bit form
Unmultiply, convert from linear 128-bit to to 32-bit sRGB (BGRA)
Convert to YCbCr and encode jpeg.

I also can't guarantee that I'll be able to enable windows optimizations very soon.

Windows optimization is deprioritized/delayed compared to the linux build for 3 reasons:

All versions of IIS combined barely total 13% of the web server market. Given that IIS 8 is only 33.4% of that sliver, one can estimate that around 4.3% of web servers use a modern version of Windows/WIC.
Optimizing on Linux is easy; GCC is (comparatively) a dream to work with.
Optimizing on Windows is hard; MSVC takes a very different approach to vectorization, and tends to produce code 2-6x slower than the equivalent GCC result in my unscientific tests. I also don't have dedicated windows hardware - just an ubuntu (Skylake Xeon) workstation and a mac laptop with some VMs. Thus I spin up Azure instances for official windows benchmarks - and do so with less frequency.

Optimizing for mainstream server operating systems first, and adding alternates for Windows later, has so far been the most time-effective approach for my work, and I have to remain focused on effective use of time given that I have a very near deadline.

That said, I'm happy to create a benchmarking option in the imageflow prototype for you to play with - but given that I would need to do the comparison in Windows, I will probably delay that until you publish MagicScaler to GitHub. In general I don't benchmark or compare closed-source to open-source projects as it can be excessively time consuming if even the smallest changes are required to produce an apples-to-apples benchmark (as well as the practical impossibility of verifying the same if they are self-reporting).