Consider adding support for mango (jpeg/png)

gunta commented 4 years ago

Why:

Mango JPEG decoder is faster than libjpeg-turbo. This means that we can have a 2-7x improved latency according to these benchmarks.

This would provide huge benefits for Image as a Service use cases where throughput matters but latency also does.

jcupitt commented 4 years ago

Hello @gunta,

That looks very nice, I'd not seen mango before. The benchmarks are impressive.

Let's revisit this when it's been packaged.

gunta commented 4 years ago

Looks its performance for PNG also is even better than libspng and stb.

jcupitt commented 4 years ago

Perhaps, though I'm puzzled by those numbers, libspng should be much quicker than libpng. I wonder if there's a build problem.

ghost commented 4 years ago

There was a build problem; the default build is for DEBUG so the BREW version was debug.. :D

ghost commented 4 years ago

Re: png, did some optimizations this week so it's ~30% faster. Color conversion still needs some tweaking, it's straight C++ code (no SIMD intrinsics) because it's performance effect is relatively small. The changes were: 1. fixed some filtering code to generate better code (especially with gcc). 2. use libdeflate, which is really fast (!) and 3. organized code so that it does more work with same buffers immediately - less passes over the data. A bit more work ahead and hopefully the working data remains in L1 until it is flushed out (that's the goal anyway). Cheers!

gemini133 commented 3 years ago

I added vips test based on the mango's jpeg_benchmark.cpp .

here is the result with 19M jpg on i7-9700 with HT on

./test-mango jpg.jpg 
----------------------------------------------
                load         save             
----------------------------------------------
mango:       240.6 ms     195.0 ms 
vips:         1.1 ms     749.3 ms

time vips copy jpg.jpg copy.jpg[strip=true,Q=100]

real    0m0.729s
user    0m0.964s
sys 0m0.037s

vips was bound with libjpeg-turbo and I enable avx512 for mango.

vips-8.10.4-Tue Dec 15 07:56:59 UTC 2020

ldd $(which vips) | grep jpeg
libjpeg.so.62 => /lib/x86_64-linux-gnu/libjpeg.so.62 (0x00007f4906d6a000)
ls -la /lib/x86_64-linux-gnu/libjpeg.so.62
/lib/x86_64-linux-gnu/libjpeg.so.62 -> /opt/libjpeg-turbo/lib64/libjpeg.so.62

the generated files: 32M t-mango.jpg 49M t-vips.jpg

Here is the code and please check if it's written properly. @jcupitt

``` /* MANGO Multimedia Development Platform Copyright (C) 2012-2020 Twilight Finland 3D Oy Ltd. All rights reserved. */ #include #include using namespace vips; using namespace mango; using namespace mango::filesystem; // ---------------------------------------------------------------------- // print // ---------------------------------------------------------------------- void print(const char* name, u64 time0, u64 time1, u64 time2) { printf("%s", name); printf("%7d.%d ms ", int((time1 - time0) / 1000), int((time1 - time0) % 10)); printf("%7d.%d ms ", int((time2 - time1) / 1000), int((time2 - time1) % 10)); printf("\n"); } // ---------------------------------------------------------------------- // main() // ---------------------------------------------------------------------- int main(int argc, const char* argv[]) { if (argc < 2) { printf("Too few arguments. usage: \n"); exit(1); } if (VIPS_INIT (argv[0])) vips_error_exit (NULL); printf("----------------------------------------------\n"); printf(" load save \n"); printf("----------------------------------------------\n"); u64 time0; u64 time1; u64 time2; // ------------------------------------------------------------------ time0 = Time::us(); Bitmap bitmap(argv[1]); time1 = Time::us(); ImageEncodeOptions options; options.quality = 1.00f; bitmap.save("t-mango.jpg", options); time2 = Time::us(); print("mango: ", time0, time1, time2); // ------------------------------------------------------------------ time0 = Time::us(); VImage in = VImage::new_from_file (argv[1], VImage::option()->set ("access", VIPS_ACCESS_SEQUENTIAL)); time1 = Time::us(); in.write_to_file ("t-vips.jpg[strip=true,Q=100]"); time2 = Time::us(); print("vips: ", time0, time1, time2); vips_shutdown (); } ```

gemini133 commented 3 years ago

here is the meson.build just for your convenience

project(
    'test mango', 'cpp',
    default_options : [
        'c_std=c11',
        'cpp_std=c++14',
        'default_library=static'
    ]
)
cpp = meson.get_compiler('cpp')
mango_deps = cpp.find_library('mango', required: true)
pthread = cpp.find_library('pthread', required: true)
vips = dependency('vips-cpp')
executable('test-mango', 'test.cpp', dependencies : [mango_deps, pthread, vips])

CC=gcc CXX=g++ meson build; ninja -C build; cd build; ./test-mango jpg.jpg

jcupitt commented 3 years ago

Yes, looks reasonable, though libvips disables chrominance subsampling for Q>=90, so you'd want to do the same for mango for this benchmark.

ghost commented 3 years ago

The encoder is always 4:4:4, not that much options for encoding. More paths can be added but I never needed anything more, and I am mostly writing this code for myself so it's not that easy to use. Gemini suggested versioning the library and I agree that should be done but this is "just" a hobby project and my main programming time goes into other things.

I looked at the vips jpeg decode timings and verified the results on AMD TR, these are with libjpeg-turbo, or some hw accelerated path? I tested libjpeg (-t) through GTK and get roughly same timings I get with calls to libjpeg directly. I noticed some GTK code there so figured maybe they were doing something (only tested 2.0, not 3.0). 240ms and 1.1 ms above, 200x faster, what's the trick? =D

I noticed a nice trick by ARM engineers to detect zeros in the DCT block to speed up the encoder, could do the same for x86.. but that's on encoder side. The approach was two parter: part one, de-zigzag into SIMD registers and then basically use compare to generate masks. Intel has lane-msb-to-mask instruction which should be useful for this.. but I am steering off-topic so no more about that..

TL;DR - 4:4:4 is always used by the encoder, it's very basic. It's not a big feat to add modes, just never needed any. libjpeg-turbo is really good library, I'd just keep using it. I wrote my code inspired by OpenCL feasibility study I did at ATi in 200? when compute support was being added to a GPU silicon we were designing. I used JPEG as one of the test cases so that's why I got into it. This was so one lifetime ago.. but wrote a CPU decoder from scratch for fun as an afterthought and here we are.

Now that you know what kind of crap the code is, you probably are better informed to stay out of it. :D

jcupitt commented 3 years ago

Hi @t0rakka, thanks for the information! Mango looks really nice, and the benchmarks are impressive.

Maybe try Q=89 for the benchmark then? It should give more comparable results. The large difference in output file size should vanish, at least.

libvips decode time is quick, but all it does is setup. Pixel decode happens at the same time as encode -- running the two together lets libvips overlap encode and decode for a 2x speedup.

Does the mango API support read and write of pixel buffers (as opposed to whole images)? libvips would need that to allow the overlapping. libvips usually reads and writes 8 or 16 scanlines at a time.

Has mango been fuzzed? libvips is part of oss-fuzz so an experimental read and write with mango might be a simple way to get mango into the fuzzing programme.

ghost commented 3 years ago

Yes, I have used zzuf for this purpose. A lot of checks in the chunk processing come from fuzzing and reading the specification with a magnifying glass. :D

The weakest part is the huffman decoder which is a bit speed-sensitive so using checks sparingly. It checks that it won't read out of bounds and writes are (mostly) guarded.

That said, there is still some work left in this area since when I run 1,000,000 iterations I get 20 or so failures. Not with all inputs but I have found some that still manage to fail.. but it is much more rare than before fuzzing that's for sure.

My main concern is non-sanitized random inputs, since I wrote a "coder's tool" that I use for checking compressed textures and other image formats. It's kind of neat that I can load one image as background, use other as a layer on top and move it around with mouse, fidding with blending modes and such. If I want to see alpha channel, I can and so on.. the key feature for me is that I can write plugins with GLSL for different parts of the rendering pipe so I can upgrade the tool for any new task that comes along. It's internal tool and so badly documented and crappy coding that I don't plan on releasing sources.. (before anyone asks).. but when some broken image is in a folder I am scanning, it sucks if the tool crashes so I sat down and enforced the code against broken inputs. Has worked fairly alright since then.. if you find any images that won't work, even if they SHOULDN'T, I can take a peek and try to guard. It's not a defence to just say; "this is not a valid jpeg file", since broken and corrupted files exist and we should at least be able to detect it in runtime (when possible).

So I did fuzz just for practical self-interest reasons; I don't like crashes even if they aren't "my fault" (since at end of the day, they are).

Not all formats are fuzzed, it would be a dream situation if fuzzing everything was part of regression tests and had proper CI, but the problem is time and how to best use it.. these are on the TODO list, but the list seems to be growing, so.. :P

ghost commented 3 years ago

About the return values.. I used to throw exceptions when something went wrong, but it wasn't very practical for UI use scenarios like the tool I described above. I rather get a diagnostic about the decoder state so that I can give feedback to the user (=myself) from the UI. Exception can only pack so much information.. it's also cool I get some telemetry from the instrumentation that I can present in the UI, it has been quite useful (for me).

Something I planned on doing a while now is changing the debugPrint() to be runtime configurable so that don't need to compile the code in DEBUG mode when I just want to know what's going on in the decoder. If I do that, then I'll upgrade other encoder/decoders as well.

I only do encode/decode of whole image unfortunately. Completely random access would be a bit tough universally across all formats, formats like PNG and JPEG are fairly sequential and have to be processed in specific order. It would be possible to split the decoding into segments, which would still be processed in specific order but that scheme breaks apart slightly with interlaced and progressive modes. Of course, we could in these cases just say that the segment height is image height.

Internally I have tried to implement the decoders to work so that data lifetime is as temporal as possible to keep things in cache and then never touch again. This of course was thrown back a little bit by the multi-threading (where it makes sense) since each thread wants to keep the data around until the worker is finished. It's kind of tuning and trade-off going on.. but let's say we have a progressive image, then we naturally have to keep all of the DCT blocks until all of the data is available, but, we only keep the full data in memory when it is needed (progressive / multiscan).

Let's say we have restart interval; in such images the segment height should be interval size but I can't remember that the specification guarantees the interval to be a multiple of MCU scan.. so it would be a bit more complicated than is necessary.

On the other hand, JPEG is fairly ageing and complicated format. The simple DCT-quantize compression pipeline is still very useful, but if I were to design a format in 2020 I'd probably have tiling built-in and not make the format as unnecessarily flexible that the JPEG is. It's really too flexible in some areas, making decoding a bit messy.. :D I done testing with DCT+RNS and the results aren't that great, the RLE-Huffman hybrid in the JPEG is really efficient and clever design. The fact that runs of zeros has special handling makes it hard to compete with even with modern bit-compressors, so that's pretty cool.

HEIF, H.265 and so on are cool but patent minefield..

If it was unclear the short version is that segmented decoding is not supported (I get what you mean but unfortunately it would be a bit much work to add support). I have to think about it as it's not unreasonable request, multiple passes over large amount of data is less efficient than small working set and then moving on to new data but as we don't do that right now the answer is that it's not supported..

ghost commented 3 years ago

To clarify why it never was a problem for me is that I often decode into mapped memory, say, I create GL texture, allocate storage and then use PBO to map the GL's staging buffer and decode directly into it. With GLES and mobile devices the staging buffer often equals texture's backing storage if its UMA and texture storage is linear (and the GPU has MMU, which is a given for past 10+ years) so the decoding in good day goes directly into the backing memory.

I used to work with GPUs w/o MMU and it's a real pain since all storage has to be contiguous memory, so the device had to put aside dedicated "graphics memory" at boot time. Ugh.. stone age with cavemen with sticks..

So the way I mostly use the code has been fine & happy the way it works.. and when profiling, it doesn't even matter, but in theory it's "better", even if numbers say it doesn't matter.

What I really need is to get my hands on DirectStorage and NG consoles with their "decompression engine-on-GPU", they want a bit different approach, some along the lines of: start a decoding job and get notified when it's complete -kind of deal. I don't have access yet so don't know how these things look like in practise, but I'll find out.. my prediction is the start a job get notified, in one form or other.. Microsoft has the WaitForMultiple -APIs they used for decades, somehow I won't be surprised if the DirectStorage has something similar in it.. but it could be something new as well, dying to know more. :)

jcupitt commented 3 years ago

Yes, libvips has two modes for PNG and JPG --- most of them are read and written in sequential chunks of 8 or 16 scanlines, but for interlaced images it'll stop and read and write the whole thing in one go. It doesn't need random access.

libvips uses glib as a base library, and that has a nice logging framework.

AVIF is nice -- as good as HEIC, but no nasty patent problems. As you say, tiling is a great way to parallelise encode and decode, and gets you nice locality too.

Anyway, I think mango would need a scanline-based read/write interface for libvips. Let's park this issue for now, and please do reopen if/when that happens.

jcupitt commented 3 years ago

Heh yes the new console async IO system sounds really interesting, and decompress to GPU memory is very cool. It could obviously replace the huffman layer of something along the lines of JPEG.

martin19 commented 3 years ago

Hi @jcupitt , what happened to the mango library? Is it still available anywhere?

jcupitt commented 3 years ago

Looks like it's been taken off github.

You'd need to ask the maintainers if it has a new address.

kleisauke commented 3 years ago

fwiw, according to this reddit post it has been taken off GitHub due to copyright issues.

libvips / libvips

Consider adding support for mango (jpeg/png) #1537