nothings / stb

stb single-file public domain libraries for C/C++
https://twitter.com/nothings
Other
26.95k stars 7.72k forks source link

Making JPEG decoder faster #944

Closed photopea closed 4 years ago

photopea commented 4 years ago

I would like to use your JPEG decoder in a webapp. I compiled it to WebAssembly (13.8 kB .WASM file).

It works well and is quite fast (3x faster than the Javascript decoder from the pdf.js library).

Do you think it could be made even faster? Sadly, WASM does not support SIMD instructions, so we should aim for other ways of optimization. What about reading Huffman codes faster (use different hash tables, to decode more bits at one step). Do you think it is possible?

rygorous commented 4 years ago

Making the tables larger isn't free. The returns are rapidly diminishing (longer codes are rare; after all, that's the whole point of Huffman coding!) so the wins end up small, while the cost of building a larger decode table (not a hash table by the way, just a straight array) adds overhead to every image decode. (Also, if you make them too large, they don't comfortably fit in the L1 cache anymore, and decode performance tanks as well.)

An easier way to increase speed is to interleave the passes in a more fine-grained fashion. The current code batch-decodes DCT blocks for the whole image and then does the color-space conversion to RGB in a second pass. This is inefficient on larger images because decoded pixels drop out of the lower cache levels before they get accessed again. It should be reasonably easy to modify the code to trigger color-space conversion after every couple rows (probably one MCU's height) instead.

Also, progressive JPEG decoding is slow right now, and we haven't really investigated any ways of making it faster.

There are many others ways of making the JPEG decoder faster, and also different ways of increasing the speed of the Huffman decoder, but all those add significant complexity which is at odds with the design goals of stb_image. Feel free to fork the code if you want to make a decoder that focuses on speed above all else (or look at existing libraries like libjpeg_turbo for some hints on what you could do), but the main stb_image code will always favor simplicity over maximum speed.

On Wed, Apr 1, 2020 at 3:48 AM Photopea notifications@github.com wrote:

I would like to use your JPEG decoder in a webapp. I compiled it to WebAssembly (13.8 kB .WASM file).

It works well and is quite fast (3x faster than the Javascript decoder from the pdf.js library).

Do you think it could be made even faster? Sadly, WASM does not support SIMD instructions, so we should aim for other ways of optimization. What about reading Huffman codes faster (use different hash tables, to decode more bits at one step). Do you think it is possible?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nothings/stb/issues/944, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIHB6GHOGFEULO7VCM2TMLRKMLXPANCNFSM4LYYOCNA .

photopea commented 4 years ago

I should say, that I value the simplicity over the speed. But maybe, it is possible just to rewrite something, to get more speed. Or increase the size of the code only by 10 - 20%, while the speed gains will be much higher. I think libjpeg_turbo is fast because of cpu-specific code (assembly or SIMD instructions), which is not what I want.

I implemented a Deflate decoder once, which is much faster than the ZLIB. My decoder requires 64 kB of RAM, while ZLIB works in 32 kB of RAM. And my code is much shorter.

Could you spend an hour or two to implement your speed-up ideas? Maybe you can make it 2x faster with 20 extra lines of code. I can donate 100 - 200 USD to this project for that.

rygorous commented 4 years ago

It's not a question of money. There are libraries that target maximum performance, but stb_image very explicitly does not - see "Philosophy" section near the top of the file.

Any big changes we will make in the foreseeable future are going to be concerned with priorities #1 and #2, namely ease of use and ease of maintenance respectively. In particular, there are a few API mistakes we want to fix (addressing #1), and a bunch of functionality we want to remove because it's rarely used but causes a disproportionate amount of maintenance work (addressing #2). Perf work is a distant third on the priority list right now, and you might think that increasing the amount of code by "only 10-20%" is not a big deal in that regard but I respectfully disagree. We would also be unlikely to take a large patch that added said functionality because the bulk of the work associated with any change is ongoing maintenance and as soon as something is in a released version of stb_image, we're stuck with it.

There's a lot of stb libraries and Sean and me are maintaining them (mostly Sean). We're both busy and have other hobbies and for better or for worse, all stb library maintenance work has to fit into something like 5-6 weekends per year because that's how much time we're willing to spend on it. With a limited time budget, we have to pick our battles, and we decided a long time ago that we needed to stop the scope creep that had affected stb_image in particular up to that point, and instead focus on fixing the bugs we had (and have) because stb_image usage has grown far beyond its original intended domain (games and throw-away tools) which makes a big difference in the impact of things like security bugs.

If you want to work on a faster JPEG decoder based on the stb_image one, feel free. The license explicitly permits you to use the code whichever way you want. I'm happy to give you pointers on what areas you could improve if you wish to. However, right now anyway, adding extra complexity to increase performance is not something we would accept in a pull request for stb_image, and even if you had a 2x faster decoder we'd be very unlikely to take that patch if you got it in any way outside of fixing a stupid bug.

In short, even if you had a 2x faster decoder in a pull request right now and all I had to do was decide whether to merge a 500-line patch or not, the answer would almost certainly still be "no".

On Wed, Apr 1, 2020 at 11:50 AM Photopea notifications@github.com wrote:

I should say, that I value the simplicity over the speed. But maybe, it is possible to rewrite something, to get more speed. Or increase the size of the code only by 10 - 20%, while the speed gains will be much higher. I think libjpeg_turbo is fast because of cpu-specific code (assembly or SIMD instructions), which is not what I want.

I implemented a Deflate decoder once, which is much faster than the ZLIB. My decoder requires 64 kB of RAM, while ZLIB works in 32 kB of RAM. And my code is much shorter.

Could you spend an hour or two to implement your speed-up ideas? Maybe you can make it 2x faster with 20 extra lines of code. I can donate 100 - 200 USD to this project for that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nothings/stb/issues/944#issuecomment-607428344, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIHB6DHQ6ESYGBRGWGA73TRKOEFNANCNFSM4LYYOCNA .

photopea commented 4 years ago

I was talking about 20 extra lines of code, you are talking about 500 lines, which I never said.

I understand your position. I think you are doing a great job and I wish you good luck, keep up the good work!