Closed anforowicz closed 5 months ago
I am surprised the crate isn't doing this already! Turning the chunks into a lookup table seems like a no-brainer.
I'll take a look at the bounds checks.
What machine did you run those benchmarks on? I am getting dramatically higher numbers on a Zen 4 CPU.
Just so you know, with -C target-cpu=znver4
your expand_8bit_into_rgb8
from this PR gets autovectorized into AVX-512 instructions: https://godbolt.org/z/3bTG8oYz3
And it absolutely tanks performance on the micro-benchmarks, with some of them taking more than 2x more time. It also regresses the end-to-end decoding benchmark for the zune-paletted file by 23%.
I suspect the expansion function ends up getting called a lot on very small chunks, so the vector code is wasted and only gets in the way. I'll try to confirm this hunch.
Hmm, that function seems to be called for every row. I am surprised that autovectorization actually hurts it even on rather large images.
Zen4 has a somewhat weird AVX-512 implementation, so I wonder if LLVM just isn't realizing that the autovectorized version will be slower.
As another datapoint, I tested on a Zen3 CPU and don't see any huge performance differences compiling with -C target-cpu=znver3
I couldn't convince the compiler to remove the bounds checks from the loop. But it seems to speculate right past them without too much trouble.
A naive iterator version that removes bounds checks but copies 3 bytes instead of 4 is 25% slower than this code.
What machine did you run those benchmarks on?
AMD EPYC 7B12.
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc c puid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3d nowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nr ip_save umip rdpid Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec rstack overflow: Vulnerable: Safe RET, no microcode Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected
FWIW, I also didn't use -C target-cpu=native
or something like that, because AFAIK Chromium builds and ships a single binary for a given target platform/architecture. AFAIU this still in general allows for some (lowest-common denominator) auto-vectorization (e.g. see https://github.com/image-rs/image-png/pull/414#issuecomment-1734635158).
Ah, that runs at half the frequency of Zen4 desktop parts, so those numbers make sense. I did not use -C target-cpu=native
either.
I suppose we could use multiversioning if wider SIMD helped, but at least in palette expansion it doesn't.
@fintelia, can you PTAL?
/cc @Shnatsel (thanks again for pointing out my earlier benchmarking mistake; I read your article on eliminating bounds checks, but wasn’t able to successfully apply it - more details below)
tests/benches
for whichidentify -verbose
reportspng:IHDR.color_type: 3 (Indexed)
.)std::simd
(see below for more details)#[cfg(test)]
parts of the product code itself)transform.rs
functions private to that moduleSIMD
Memoization
My attempt to SIMD-ify
create_rgba_palette
can be found in https://github.com/image-rs/image-png/commit/c71f944033f367cf00f4ca145e473585da81fc57. It seems to actually make execution slower by ~5%.Short link to the non-SIMD-ified code can be found here: https://godbolt.org/z/KeKrfWcs5. There is no auto-vectorization AFAICT.
Execution
My attempt to SIMD-ify
expand_8bit_into_rgb8
can be found in https://github.com/image-rs/image-png/commit/63d45953fc913845d042ebd416b6c8071e6ba29c. It seems to actually make execution slower by ~60%. Using anunsafe
gather_select_unchecked
doesn’t make a difference AFAICT.The non-SIMD-ified code can be found here: https://godbolt.org/z/T9zv891dc. There is no auto-vectorization AFAICT - this makes me a bit surprised that this is faster than the SIMD-ified version.
The SIMD-ified version can be found here: https://godbolt.org/z/P19TPP33G. AFAICT it does in fact compile into SIMD instructions.
There are some inefficiencies that (I think) may be addressed by (somehow) hoisting bounds checks outside of the loop (both in the non-SIMD-ified and the SIMD-ified version):
core::slice::index::slice_end_index_len_fail
in the SIMD-ified version or tocore::panicking::panic_bounds_check
in the non-SIMD-ified version).I don’t know how to address the inefficiencies / how to do the hoisting. (I’ve tried some
assert!
s about the relativeoutput.len()
andinput.len()
but it didn’t help.)Break-even point
Memoization is a trade-off:
expand_paletted_...
functions.expand_paletted_...
functions - they take Y ns less to process each byteIf we process B bytes, then:
The breakeven point can be calculated with B=X/Y.
Note that X and Y are typically measure via microbenchmarks which are somewhat unrealistic (i.e. the benchmarks don't compete with other code for the L1 cache). Therefore the breakeven point calculated this way may be a bit too optimistic - for example it won't take into account the negative impact of additional ~1kB of cache/memory pressure from the memoized
rgba_palette
.Measurements
Without memoization we get:
With memoization we get:
Results
The measurements above give the following thresholds:
Such small images seem relatively rare. IMHO it is unjustified (in terms of extra code complexity, binary size, etc.) to implement falling back to non-memoized implementation for these kinds of images.