image-rs / jpeg-decoder

JPEG decoder written in Rust
Apache License 2.0
150 stars 87 forks source link

Wasm simd #264

Closed dustletter closed 1 year ago

dustletter commented 1 year ago

This adds a SIMD implementation using simd-128, along with a small change to fix link errors for the wasm32-wasi target. It speeds up the large_image benchmark by about 45%. All tests pass except for large_image.jpg - the image dimensions overflow on 32-bit platforms. In order to run benchmarks, Criterion needs to be updated to version 0.4, which makes Rayon an optional feature. I didn't check in this change because I didn't want to mess with dependencies for a mostly unrelated change.

Sorry for making an unsolicited pull request! I did this for a hobby project and thought it might be worth upstreaming.

HeroicKatora commented 1 year ago

Very nice. Definitely worth upstreaming! And 45%, that's just incredible :)

dustletter commented 1 year ago

It's interesting that the load/store operations have no alignment requirements so we can avoid that stack buffer nonoverlapping copy we have in other implementations.

This was surprising to me too. In the spec, it's possible to have a higher alignment, but Rust decided against making it the default. (side note: is it possible/worth it to align those slices for the other versions?)

This was almost all done by copying the SSSE3 implementation and find-and-replacing instructions (e.g. _mm_mulhrs_epi16 to i16x8_q15mulr_sat), with the exception of transpose() and the shuffle to RGB order in color_convert_line_ycbcr(). I admit I took the pointer arithmetic for granted 😅 I've gone through and added safety comments to all three unsafe blocks.

mcroomp commented 1 year ago

What about just using the Wide crate and letting the compiler do the rest? It does an amazing job picking the right instructions depending on what features you enable (SSE2,AVX,WASM) etc. Even the transpose is done surprisingly fast despite looking like it would be super slow.

Here's an example implementation: https://github.com/microsoft/lepton_jpeg_rust/blob/main/src/structs/idct.rs

HeroicKatora commented 1 year ago

Sorry for forgetting to merge this after approvals.

@mcroomp Feel free to provide a PR with performance comparison but the experience with any auto-vectorization-based approach has, historically, not been amazing.