SSSE3 implementations of 8x8 IDCT and YCbCr conversion

veluca93 commented 2 years ago

Hi folks!

This PR is more of a proof of concept / demo than an actual PR - I realize there's probably a few things to discuss/agree on/do before merging this in, but I figured a benchmark was the best way to get the discussion started ;). See also #146 for a previous take on this.

This PR contains two SSSE3 implementations of respectively 8x8 IDCT and YCbCr conversion. Why SSSE3? Two reasons: it is extremely common nowadays (99.25% of x86 PCs according to the Steam Hardware survey), while offering some operations (in particular blends) that provide some significant speedups over SSE3 or SSE2. On the other hand, the next "useful" upgrade in x86 SIMD would be AVX2, which wouldn't help that much (in particular it wouldn't help IDCT) and is not that common (85.98%).

From the benchmark, you can see speedups (with a single thread) of a factor of approximately 2x in the common case of a 444 YCbCr image.

While I didn't (yet!) implement this for ARM NEON, I expect the speedup there to be similar, likely resolving #202. See also #79.

decode a 512x512 JPEG   time:   [1.7992 ms 1.8000 ms 1.8009 ms]                                   
                        change: [-49.653% -49.613% -49.574%] (p = 0.00 < 0.05)
                        Performance has improved.

decode a 512x512 progressive JPEG                                                                             
                        time:   [4.5663 ms 4.5848 ms 4.6057 ms]
                        change: [-29.470% -28.997% -28.514%] (p = 0.00 < 0.05)
                        Performance has improved.

decode a 512x512 grayscale JPEG                                                                            
                        time:   [780.95 us 782.16 us 783.31 us]
                        change: [-36.512% -36.366% -36.222%] (p = 0.00 < 0.05)
                        Performance has improved.

decode a 2268x1512 JPEG time:   [22.317 ms 22.339 ms 22.362 ms]                                    
                        change: [-52.237% -52.149% -52.073%] (p = 0.00 < 0.05)
                        Performance has improved.

This brings me to the open questions I have related to this PR:

Unfortunately, different rounding produces slightly different results in the SIMD code, although I am fairly confident this is still within the bounds of conformance as defined in the JPEG standard ("DCTs back to the same coefficients"). I'm not sure if we'd want to make sure that the non-SIMD implementation matches the SIMD one.
The SIMD implementation doesn't handle all the values that the non-SIMD one does -- in particular it doesn't handle i16::MAX, although to my understanding such values can never appear in a JPEG bitstream.
As this code uses hand-crafted intrinsics, it makes use of two unsafe functions. I'm confident they are safe (they just do arithmetic, swizzle, and loads/store to arrays that are bounds-checked in advance), but it may be a good idea to put the SIMD code in a separate crate, in separate files and/or behind a feature flag.
Should we also SIMDfy other things? What comes to mind are other color conversions, smaller (4x4 and 2x2) DCTs and chroma upsampling.

I'd be happy to hear your thoughts on this :)

veluca93 commented 2 years ago

Thank you for this PR tada

Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

#[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.

It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.

There's also some questions regarding style in the code itself. See below.

Replied to some higher-level comments while I deal with the other ones :)

veluca93 commented 2 years ago

Thank you for this PR tada Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

#[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.

It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder. There's also some questions regarding style in the code itself. See below.

Replied to some higher-level comments while I deal with the other ones :)

I added a enabled-by-default "simd" feature to allow disabling compilation of the SIMD / unsafe code. PTAL :)

veluca93 commented 2 years ago

I like the new structure better. Thanks for integrating all of the feedback, this is quite a new kind of addition to any of the libraries (png mostly relies on auto-vectorization, if you're looking for a task afterwards smirk) and thus I'd like to find a solution that we can use as a template for similar additions otherwhere.

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

HeroicKatora commented 2 years ago

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

There's a whole adler32 stage that's going to be simd-ified soon for some speed gains. Plus it seems that the main zlib (through miniz_oxide) is not simdified despite zlib-ng demonstrating how platform-specific code might show speed gains over the similarly written zlib. And then the actual color stage involve some line-by-line filtering, permutation of bytes, etc. There's always code that can be sped up by careful SIMD. The big question is whether LLVM has already done so implicitly but then it's most likely a 'generic' variant based on avx or w/e is in the default set of extensions of modern x86-64 ABI.

veluca93 commented 2 years ago

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

There's a whole adler32 stage that's going to be simd-ified soon for some speed gains. Plus it seems that the main zlib (through miniz_oxide) is not simdified despite zlib-ng demonstrating how platform-specific code might show speed gains over the similarly written zlib. And then the actual color stage involve some line-by-line filtering, permutation of bytes, etc. There's always code that can be sped up by careful SIMD. The big question is whether LLVM has already done so implicitly but then it's most likely a 'generic' variant based on avx or w/e is in the default set of extensions of modern x86-64 ABI.

In my experience, most of the gains from manual SIMDfication would be in encoding for lz77, not in decoding - decoding is simple enough that compilers do a reasonable job about it. Although I might be wrong ;) (also probably adler32)

As for other things in png: line-by-line filtering in the decoder would be rather tricky to SIMDfy (except perhaps across the channels, but then you don't gain that much - IIRC libpng does that). Permutation of bytes would benefit though. OTOH, in the encoder, quite a bit can be done :)

veluca93 commented 2 years ago

@HeroicKatora: Any clue why the CI would fail on beta and on beta only?

fintelia commented 2 years ago

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

veluca93 commented 2 years ago

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

I can reproduce locally both on beta and nightly, and with and without rayon -- albeit on a different file...

This is looking like it will be a fun investigation :)

veluca93 commented 2 years ago

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

I can reproduce locally both on beta and nightly, and with and without rayon -- albeit on a different file...

This is looking like it will be a fun investigation :)

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects, so I modified the code not to use those intrinsics :)

HeroicKatora commented 2 years ago

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects

Have you checked rust-lang/rust for issues? This sounds like a serious codegen bug if the effect of (unsafe) instructions is incorrect.

veluca93 commented 2 years ago

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects

Have you checked rust-lang/rust for issues? This sounds like a serious codegen bug if the effect of (unsafe) instructions is incorrect.

Yup, there is https://github.com/rust-lang/rust/issues/84042 :)

veluca93 commented 2 years ago

Hi! I was thinking of doing the aarch64/arm version, but would it be possible to merge this PR first?

HeroicKatora commented 2 years ago

Some formatting changes in another branch had caused a conflict, I fixed that formatting in a merge.

image-rs / jpeg-decoder

SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211