Closed lovasoa closed 2 years ago
@fintelia , @HeroicKatora : This PR is not finalized, but I'd love to get your feedback early.
@lilith : This may be of some interest to you.
I like this direction! We need to figure out a way to handle cases were packed_simd is unavailable, but perhaps just making it an optional feature + duplicating the dequantize_and_idctblock* functions would be enough?
Yes, we can do that. But ideally, I would have wanted even non-simd targets to benefit from this change, by using simulated simd, which is still faster than loops. Unfortunately, the API of ssimd seems to be out of date, and the crate seems to be unmaintained...
I have started to work on adapting ssimd to the latest packed-simd interface, so that we can use it with stable compilers here. The amount of work to do is moderate, since we use only a few simd types and operators.
Final benchmark
$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'
decode a 512x512 JPEG time: [2.6583 ms 2.6783 ms 2.6984 ms]
packed_simd
feature disabled$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo bench 'decode a 512x512 JPEG'
decode a 512x512 JPEG time: [3.5087 ms 3.5182 ms 3.5281 ms]
~ 30% worse
packed_simd
feature enabled$ RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo +nightly bench --features="packed_simd" 'decode a 512x512 JPEG'
decode a 512x512 JPEG time: [2.0040 ms 2.0151 ms 2.0262 ms]
~ 20% better
So contrarily to what I thought, the benchmarks didn't improve, they worsened for the non-simd case.
Is there something you still wanted to do here? It's still marked as a draft and the comment alludes a partial regression.
Hi @HeroicKatora ! Yes, there is a performance regression for the non-simd version, so this PR should probably not be merged as-is. I contacted @lilith about this PR, but she told me her company doesn't need this anymore and thus isn't ready to fund this anymore, so I lost a little bit of the motivation needed to work on this.
I wanted to switch from packed_simd to simdeez, but then I found an inconsistency in simdeez, for which I proposed a PR. It took some time, but it is now merged, so there is no more blocker. Are you interested in working on this, @HeroicKatora ?
I see, and can totally feel how that would turn off some motivation. Yeah, the total of 20% performance is signifcant. Switching to a stable crate with 1.0
would be positive as well and if it helps avoid the regression even better!
I'm looking for what project to focus on next after apng
is done, so maybe. The alternative is that I focus on [image-canvas
] instead, still shepherding this if you think it has the potential.
Hi all! I was wondering if there'd be some interest in picking this up, and, if not, if anybody here would mind me trying my hand at some handcrafted SIMD optimizations :) (probably with intrinsics for x86/sse and arm/neon, as I think portable SIMD is not stabilized yet)
Yes, please pick this up !
This commit is a first step towards SIMD-accelerated IDCT. It optimizes only the first part of the IDCT, and no fallback has been implemented for non-nightly compilers.
Benchmark results:
Closes: https://github.com/image-rs/jpeg-decoder/issues/79 Please merge https://github.com/image-rs/jpeg-decoder/pull/144 first