[WIP] SIMD-accelerated IDCT

lovasoa commented 4 years ago

This commit is a first step towards SIMD-accelerated IDCT. It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers.

Benchmark results:

Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.3s or reduce sample count to 40
decode a 512x512 JPEG   time:   [2.2417 ms 2.2633 ms 2.2846 ms]                                   
                        change: [-26.302% -23.320% -20.299%] (p = 0.00 < 0.05)
                        Performance has improved.

Closes: https://github.com/image-rs/jpeg-decoder/issues/79 Please merge https://github.com/image-rs/jpeg-decoder/pull/144 first

lovasoa commented 4 years ago

@fintelia , @HeroicKatora : This PR is not finalized, but I'd love to get your feedback early.

@lilith : This may be of some interest to you.

lovasoa commented 4 years ago

I like this direction! We need to figure out a way to handle cases were packed_simd is unavailable, but perhaps just making it an optional feature + duplicating the dequantize_and_idctblock* functions would be enough?

Yes, we can do that. But ideally, I would have wanted even non-simd targets to benefit from this change, by using simulated simd, which is still faster than loops. Unfortunately, the API of ssimd seems to be out of date, and the crate seems to be unmaintained...

lovasoa commented 4 years ago

I have started to work on adapting ssimd to the latest packed-simd interface, so that we can use it with stable compilers here. The amount of work to do is moderate, since we use only a few simd types and operators.

https://github.com/lovasoa/ssimd

lovasoa commented 4 years ago

Final benchmark

Before this PR

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [2.6583 ms 2.6783 ms 2.6984 ms]

After this PR

all optimizations, `packed_simd` feature disabled

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [3.5087 ms 3.5182 ms 3.5281 ms]

~ 30% worse

all optimizations, `packed_simd` feature enabled

$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo +nightly bench --features="packed_simd" 'decode a 512x512 JPEG'

decode a 512x512 JPEG   time:   [2.0040 ms 2.0151 ms 2.0262 ms]

~ 20% better

So contrarily to what I thought, the benchmarks didn't improve, they worsened for the non-simd case.

HeroicKatora commented 4 years ago

Is there something you still wanted to do here? It's still marked as a draft and the comment alludes a partial regression.

lovasoa commented 4 years ago

Hi @HeroicKatora ! Yes, there is a performance regression for the non-simd version, so this PR should probably not be merged as-is. I contacted @lilith about this PR, but she told me her company doesn't need this anymore and thus isn't ready to fund this anymore, so I lost a little bit of the motivation needed to work on this.

I wanted to switch from packed_simd to simdeez, but then I found an inconsistency in simdeez, for which I proposed a PR. It took some time, but it is now merged, so there is no more blocker. Are you interested in working on this, @HeroicKatora ?

HeroicKatora commented 4 years ago

I see, and can totally feel how that would turn off some motivation. Yeah, the total of 20% performance is signifcant. Switching to a stable crate with 1.0 would be positive as well and if it helps avoid the regression even better!

I'm looking for what project to focus on next after apng is done, so maybe. The alternative is that I focus on [image-canvas] instead, still shepherding this if you think it has the potential.

veluca93 commented 2 years ago

Hi all! I was wondering if there'd be some interest in picking this up, and, if not, if anybody here would mind me trying my hand at some handcrafted SIMD optimizations :) (probably with intrinsics for x86/sse and arm/neon, as I think portable SIMD is not stabilized yet)

lovasoa commented 2 years ago

Yes, please pick this up !

image-rs / jpeg-decoder