libvips / vips-bench

a simple image processing benchmark implemented in a range of image processing packages
20 stars 3 forks source link

Possible! Benchmark! Addition! Of! Yahoo! Ymagine! #4

Closed lovell closed 9 years ago

lovell commented 9 years ago

File this under VIPS "evangelism"...

As well as a closed-source GPU pipeline, Flickr also use their own BitmapFactory-inspired Ymagine open source library.

When alerted to VIPS, specifically the Speed and Memory Use page, Flickr's Head of Engineering @tedd4u said he "would be curious where [Ymagine] fits into that benchmark list ;)".

jcupitt commented 9 years ago

Oh, interesting. I've tried building ymagine but they have some kind of very funky homemade build system (I think?) with no instructions.

There's this:

https://github.com/yahoo/ygloo-ymagine/blob/master/jni/tools/build.sh

No quite sure how to run it, I'll dig a bit more. Or is there a standard way to build java C extensions?

jcupitt commented 9 years ago

Oh crumbs it's all android. I'll see if I can find all the bits I need to build it.

lovell commented 9 years ago

If https://github.com/yahoo/ygloo-ymagine/blob/master/tests/generic/ymagine/main_transcode.c is to believed it should be usable from C directly rather than via Java/JNI.

Perhaps @hassold may be able to help here. Failing that, my friend @millimeep (hi Iain) used to work in Yahoo's build team.

millimeep commented 9 years ago

Having had a quick look through the build files at https://github.com/yahoo/ygloo-ymagine it looks like anything Y! specific has been removed.

It should build with gradle or ant?

It possibly needs some ant files that are supplied by the Android SDK??

jcupitt commented 9 years ago

Hi @millimeep tthank you very much for checking. I've now got all of the Android SDK stuff in and "ant release" seems to work. It doesn't seem to build any of the native stuff in jni/ though, is there some simple way to make this happen?

Sorry to be so dumb, I've never touched any android or java build stuff before.

jcupitt commented 9 years ago

Ooop, looks like I'm missing part of the NDK, I'll keep digging.

hassold commented 9 years ago

Note sure about the context, and if you wish to build for Android, iOS, or desktop/server. There are some instruction on building Ymagine for either your native system (Linux, MacOSX, ...), Android or iOS at https://github.com/yahoo/ygloo/wiki/Getting-started

Build system supports using Android NDK tools the classic way when targeting Android platform, but for other platforms, all you need is repo to checkout the sources tree, and make (and of course toolchain). NDK or other Android tools are not needed to build for Linux or MacOSX target.

Short HOWTO guide is:

mkdir ymagine
cd ymagine
repo init -u git@github.com:yahoo/ygloo.git
repo sync -j4
make all native -j4

Output binaries can then be found into out/target/ directory. For example, if building on MacOSX:

./out/target/darwin-x86_64/bin/ymagine  transcode -width 320 -height 320 -format jpg -force framework/ymagine//tests/android/imageview/assets/sample2.jpg out.jpg

I also just pushed an update of our latest developments into the public repo, be sure to fetch the up to date version.

jcupitt commented 9 years ago

Oh that's much better, I have an executable now. Thank you, I was drowning in 100s of MB of Android SDK//NDK. I hadn't seen that wiki page.

I'll see if I can make a C benchmark program.

hassold commented 9 years ago

Don't hesitate to let me know what you want to implement, I'll be happy to provide template for the C implementation (in particular the right way to set transcode options in decoding pipeline callbacks). For a quick command line test, it seems that what you want may be close from something like:

ymagine  transcode -crop 600x400@100,100 -width 320 -height 320 -sharpen 0.1 -format jpg -force in.jpg out.jpg
hassold commented 9 years ago

Here is code to decode an input, crop 100 pixels on each side, resize to 90%, sharpen and save result. The use of callback in decoding pipeline isn't the most efficient way to achieve it, but probably the most natural one.

Also, I'm realizing that your benchmarks are about performing this operation on a TIFF input. Ymagine focused on providing optimized support for jpeg, webp and png formats, TIFF support has been removed. But... well, if you wish to perform similar benchmark for a JPG input, here it is...

#include "ymagine/ymagine.h"

#include <unistd.h>
#include <fcntl.h>

#ifndef O_BINARY
#define O_BINARY 0
#endif

/* Using a callback to set output options dynamically based on input image */

static int
progressCallback(YmagineFormatOptions *options,
                int format, int width, int height)
{
  /* scaleMode can be YMAGINE_SCALE_CROP or YMAGINE_SCALE_LETTERBOX */
  int scaleMode = YMAGINE_SCALE_LETTERBOX;
  int pad = 100;
  int cropwidth;
  int cropheight;
  int outwidth;
  int outheight;

  if (width <= 2 * pad || height <= 2 * pad) {
    return YMAGINE_OK;
  }

  cropwidth = width - 2 * pad;
  cropheight = height - 2 * pad;
  YmagineFormatOptions_setCrop(options, pad, pad, cropwidth, cropheight);

  outwidth = (cropwidth * 90) / 100;
  outheight = (cropheight * 90) / 100;
  if (outwidth < 1) {
    outwidth = 1;
  }
  if (outheight < 1) {
    outheight = 1;
  }
  YmagineFormatOptions_setResize(options, outwidth, outheight, scaleMode);

  return YMAGINE_OK;
}

int main(int argc, const char* argv[])
{
  int fdin;
  int fdout;
  const char* infile;
  const char* outfile;
  int rc = YMAGINE_ERROR;

  if (argc < 3) {
    fprintf(stdout, "usage: bench <infile> <outfile>\n");
    return 0;
  }

  infile = argv[1];
  outfile = argv[2];

  fdin = open(infile, O_RDONLY | O_BINARY);
  if (fdin < 0) {
    fprintf(stdout, "failed to open input file \"%s\"\n", infile);
  } else {
    int fmode = O_WRONLY | O_CREAT | O_BINARY;

    /* Truncate file if it already exisst */
    fmode |= O_TRUNC;

    fdout = open(outfile, fmode, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
    if (fdout < 0) {
      fprintf(stdout, "failed to open output file \"%s\"\n", outfile);
    } else {
      Ychannel *channelin = YchannelInitFd(fdin, 0);
      Ychannel *channelout = YchannelInitFd(fdout, 1);
      YmagineFormatOptions *options;

      options = YmagineFormatOptions_Create();
      YmagineFormatOptions_setFormat(options, YMAGINE_IMAGEFORMAT_JPEG);
      YmagineFormatOptions_setSharpen(options, 0.1);
      YmagineFormatOptions_setCallback(options, progressCallback);
      rc = YmagineTranscode(channelin, channelout, options);
      YmagineFormatOptions_Release(options);

      YchannelRelease(channelout);
      YchannelRelease(channelin);

      close(fdout);
    }

    close(fdin);
  }

  return 0;
}
jcupitt commented 9 years ago

Wow, thanks very much! I'll add it to the test page.

I guess the 10% resize there is your high-quality one. Most of the other tests on that page are a simple bilinear affine, which seems unfair. Do you have a feeling for how much more expensive your resize is?

The other tests are mostly just doing a 3x3 conv. Will your 0.1 sharpen be roughly comparable?

Yes, I'll use jpg for the ymagine one and compare it to vips with a jpg image,

jcupitt commented 9 years ago

I tried with a 5k x 5k jpeg and got these numbers on my laptop (dell xps-13):

$ time /home/john/ymagine/out/target/linux-x86_64/bin/ymagine transcode -crop 4800x4800@100,100 -width 4320.0 -height 4320.0 -sharpen 0.1 -format jpg -force tmp/x.jpg tmp/x2.jpg

real    0m1.313s
user    0m1.300s
sys 0m0.004s
$ time ./vips-c tmp/x.jpg tmp/x2.jpg

real    0m0.594s
user    0m1.764s
sys 0m0.024s

That's with the vips autothread stuff enabled. If you turn off threading you see:

$ export VIPS_CONCURRENCY=1
$ time ./vips-c tmp/x.jpg tmp/x2.jpg

real    0m1.029s
user    0m1.256s
sys 0m0.032s

So pretty similar to ymagine. Even with threading off, vips still runs a background write-behind thread, hence user > real.

hassold commented 9 years ago

The sharpening filter uses also a 3x3 convolution on the fly in the scaler (never accumulating more than 3 scan lines at a time), so this part should be fair compare. The value passed (say 0.1) is a sigma parameter not expanding the matrix, just changing amplitude, so it has no impact on performance (except of course if set to 0.0 which disable sharpening). Max sharpening is achieved with a value of approx. 0.7.

Regarding the performance, the downsizing approach uses a mix of early DCT truncation and then an exact region averaging (with further optimization paths for some specific ratios, e.g. ones that are common in Flickr, but 90% ratio doesn't trigger those optimization).

Ymagine is pretty sensitive to target architecture optimization (with a strong focus we've put on ARM/NEON optimization), but will try to find some time looking at your test case more in details. Thanks for the early numbers.

jcupitt commented 9 years ago

I tried your C version and it's a little quicker:

$ time ./ymagine-c tmp/x.jpg tmp/x2.jpg 

real    0m1.041s
user    0m1.036s
sys 0m0.000s

Exact region averaging sounds like a box filter, is that right? It should be very roughly comparable to a bilinear then.

Another difference would be that I usually build libvips with gcc and your build system seems to default to clang. I'll try doing a clang build of vips.

jcupitt commented 9 years ago

Here's the output of the vips benchmark test on my laptop (dell xps-13), including ymagine:

program, time (s), peak memory (MB)
ppm-vips-c, 0.39, 33.3984375
vips-c, 0.41, 35.25
vips-cc, 0.42, 42.125
vips.py, 0.44, 44.2421875
ruby-vips, 0.47, 44.89453125
vips8.py, 0.48, 46.58984375
jpg-vips-c, 0.58, 46.6484375
vips, 0.62, 36.41796875
vips8-cc, 0.64, 34.91015625
ruby-vips8, 0.72, 57.8984375
pnm, 0.95, 76.16796875
nip2, 0.96, 75.1484375
ymagine-c, 1.02, 3.52734375
ppm-gm, 1.04, 490.08203125
gm, 1.08, 491.078125
opencv, 1.23, 202.578125
jpg-gm, 1.38, 489.953125
convert, 1.34, 484.05078125

Memuse is impressively low (or have I made an awful error?), speedwise vips is very roughly twice as fast, but of course vips is using both cores.

hassold commented 9 years ago

Memory footprint and optimization of the pipeline to never allocate a large image or buffer is the key focus on Ymagine development and architecture. This was initially a mobile-first framework solving the problem of high-efficiency image decoding and scaling of high-res picture for our mobile products.Scaling a 5kx5k image to 90% is definitely not what it is optimized for. If you would run benchmark for resizing that 5kx5k picture into say a 600x600 preview, the memory efficiency would be even more significant (never allocating buffer larger than 3 scanlines at the input resolution), and the speed would probably be competing much better.

jcupitt commented 9 years ago

vips uses roughly 128x128 tiles in this test, so it needs buffers of 128 scanlines. Times two, since it has to double-buffer output for threading, then times four for four threads (two real, two HT), then times four for the four pipeline stages. It all adds up, ouch.

Yes, this test is stressing image read and image write. This flatters vips since it tries to run read and write in parallel. If the output were small they'd look a lot closer.

I realized the vips test is actually using a bicubic interpolator. A box filter would take 0.1s off the vips time.

jcupitt commented 9 years ago

I tried on the machine I run the "speed and memuse" benchmarks on, a 6-core xeon. ymagine stays at about 1.07s, vips with a jpg source is 0.38s, so vips is just under 3x faster. That would put ymagine in fourth place, I think.

Shall I add ymagine to the table, plus a note that it's single-threaded? I'm not certain how fair (or representative) this test is.

tedd4u commented 9 years ago

First, thanks @jcupitt for doing this work! Regarding including ymagine, why not? The workload we target with ymagine is pretty different from this test though. Imagine an infinite scrolling list that we want to display at 60 fps where each image is being resized down from 500, 640 or 800 pixels by 5-25%.

hassold commented 9 years ago

Feel free to report performance of Ymagine for your use case, even if it's indeed not the typical scenario for which Ymagine has been architectured and tuned for. The memory footprint was our strong/main focus. Ymagine aims at dealing with smaller output resolutions, both as a mobile client library or for the Flickr camera roll. Last, leveraging on multiple cores wasn't enabled - though we have a branch with multithreaded decoder - because on our typical deployment environment (mobile or servers with massive concurrent load), the other cores are also under heavy load so overall, doesn't improve total throughput by distributing computing of each resize. But it's always interesting to have data point about how one component behave in one scenario, so I find it totally fair and useful to have it reported for your use case.

lovell commented 9 years ago

Amazing work, thanks everyone. The low memory usage of Ymagine is really quite impressive.

John, given the benchmark page says "On a single-core machine the table would look quite different" it might be worth including your VIPS_CONCURRENCY=1 example in the list for comparison.

It would be interesting, but strictly left as an exercise for the reader, to try on ARM hardware given Ymagine has been optimised for it. liborc supports NEON intrinsics so vips' convolution operations hopefully shouldn't suffer too much here.

I use AWS instances for the sharp benchmarks as that tends to be the sort of environment people use it with. In the (highly unlikely) event I find myself with some free time over the next few months I may create an experimental Node.js binding for Ymagine to see how it compares.

jcupitt commented 9 years ago

OK, I've updated the "speed and memory use" page to include ymagine. I've added some notes about threading and the use of JPEG images in this case. I've reported ymagine's speed relative to vips-c with a JPEG source, though of course a lot of time is being spent in libjpeg in this case. Thank you again @hassold for supplying the benchmark code!

I suppose the argument in favour of threading ymagine would be that although on a loaded server it wouldn't improve overall throughput, it might reduce latency. I can imagine that being useful for on-demand server-side image resizing.

Did you consider a write-behind thread? You have two output buffers, each 8 scanlines high. When one fills, you swap them and a background thread sends those lines to libjpeg. It's simple to implement and you can hide almost all the write time. I suppose for small image output the write time is not very significant anyway.

@lovell I wondered about the VIPS_CONCURRENCY=1 timing, but didn't include it in the end since it sets the size of the worker threadpool, and it does not turn off the write-behind thread. It felt slightly misleading.