Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8

anforowicz commented 5 months ago

@fintelia, can you PTAL?

/cc @Shnatsel (thanks again for pointing out my earlier benchmarking mistake; I read your article on eliminating bounds checks, but wasn’t able to successfully apply it - more details below)

As before, I think it may be best to avoid squeezing the commits when merging - I think it is desirable to preserve individual commits
The end-to-end speed-up looks like this: decode/paletted-zune.png: time: [-12.282% -11.986% -11.625%] (p = 0.00 < 0.05). (This is the only image in tests/benches for which identify -verbose reports png:IHDR.color_type: 3 (Indexed).)
When working through my old list of planned changes I made some adjustments:
- I wasn’t able to achieve speed-ups via std::simd (see below for more details)
- I decided that falling back to non-memoized processing isn’t worth it (see below for more details)
- I decided to compare memoized-vs-non-memoized results only in palette-specific tests (rather than in #[cfg(test)] parts of the product code itself)
- I’ve also realized that some palette benchmarks were using a PLTE chunk where the length wasn’t a multiple of 3… ooops… this is fixed in the very first commit of this PR
- And I’ve also realized that in my previous PR I missed an opportunity to make most of the transform.rs functions private to that module

SIMD

Memoization

My attempt to SIMD-ify create_rgba_palette can be found in https://github.com/image-rs/image-png/commit/c71f944033f367cf00f4ca145e473585da81fc57. It seems to actually make execution slower by ~5%.

Short link to the non-SIMD-ified code can be found here: https://godbolt.org/z/KeKrfWcs5. There is no auto-vectorization AFAICT.

Execution

My attempt to SIMD-ify expand_8bit_into_rgb8 can be found in https://github.com/image-rs/image-png/commit/63d45953fc913845d042ebd416b6c8071e6ba29c. It seems to actually make execution slower by ~60%. Using an unsafe gather_select_unchecked doesn’t make a difference AFAICT.

The non-SIMD-ified code can be found here: https://godbolt.org/z/T9zv891dc. There is no auto-vectorization AFAICT - this makes me a bit surprised that this is faster than the SIMD-ified version.

The SIMD-ified version can be found here: https://godbolt.org/z/P19TPP33G. AFAICT it does in fact compile into SIMD instructions.

There are some inefficiencies that (I think) may be addressed by (somehow) hoisting bounds checks outside of the loop (both in the non-SIMD-ified and the SIMD-ified version):

each loop not only maintains pointers into input and output, but also keeps updating the slice length
each loop iteration does some bounds checks (e.g. jumping to core::slice::index::slice_end_index_len_fail in the SIMD-ified version or to core::panicking::panic_bounds_check in the non-SIMD-ified version).

I don’t know how to address the inefficiencies / how to do the hoisting. (I’ve tried some assert!s about the relative output.len() and input.len() but it didn’t help.)

Break-even point

Memoization is a trade-off:

On one hand, memoization requires spending X ns before starting to call expand_paletted_... functions.
On the other hand, memoization improves the throughput of the expand_paletted_... functions - they take Y ns less to process each byte

If we process B bytes, then:

With memoization expansion costs X + BBaselineThroughput - BY
Without memoization expansion costs B*BaselineThroughput

The breakeven point can be calculated with B=X/Y.

Note that X and Y are typically measure via microbenchmarks which are somewhat unrealistic (i.e. the benchmarks don't compete with other code for the L1 cache). Therefore the breakeven point calculated this way may be a bit too optimistic - for example it won't take into account the negative impact of additional ~1kB of cache/memory pressure from the memoized rgba_palette.

Measurements

Without memoization we get:

Summary:
- Construction average: 21.029 ns
- Expansion average: 0.4643 ns per output byte
- RGB8 expansion: 0.4181 ns per output byte
- RGBA8 expansion: 0.5245 ns per output byte
Details:
- expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: 1.6935 GiB/s
- expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: 2.7610 GiB/s
- expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: 1.4933 GiB/s
- expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: 2.0579 GiB/s
- expand_paletted(ctor)/plte=256/trns=256: 21.012 ns
- expand_paletted(ctor)/plte=224/trns=32: 20.990 ns
- expand_paletted(ctor)/plte=16/trns=1: 21.088 ns

With memoization we get:

Summary:
- Construction average: 177.065 ns (+156.036 ns)
- Expansion average: 0.1981 ns (-0.2661 ns) per output byte
- RGB8 expansion: 0.1845 ns (-0.3259 ns) per output byte
- RGBA8 expansion: 0.1534 ns (-0.3711 ns) per output byte
Details:
- expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: 2.2342 GiB/s: +32.042%
- expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: 4.4242 GiB/s: +60.373%
- expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: 3.8193 GiB/s: +152.83%
- expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: 8.3201 GiB/s: +300.30%
- expand_paletted(ctor)/plte=256/trns=256: 231.59 ns: +996.76%
- expand_paletted(ctor)/plte=224/trns=32: 222.06 ns: +945.54%
- expand_paletted(ctor)/plte=16/trns=1: 77.547 ns: +266.00%

Results

The measurements above give the following thresholds:

Overall: 586 output bytes
RGB8: 478 output bytes = 159 pixels = ~13x13 pixels
RGBA8: 420 output bytes = 105 pixels = ~10x10 pixels

Such small images seem relatively rare. IMHO it is unjustified (in terms of extra code complexity, binary size, etc.) to implement falling back to non-memoized implementation for these kinds of images.

Shnatsel commented 5 months ago

I am surprised the crate isn't doing this already! Turning the chunks into a lookup table seems like a no-brainer.

I'll take a look at the bounds checks.

Shnatsel commented 5 months ago

What machine did you run those benchmarks on? I am getting dramatically higher numbers on a Zen 4 CPU.

Benchmark results on Zen 4

``` expand_paletted(exec)/trns=no/src_bits=4/src_size=5461 time: [7.0350 µs 7.0351 µs 7.0352 µs] thrpt: [4.3376 GiB/s 4.3376 GiB/s 4.3377 GiB/s] Found 73 outliers among 500 measurements (14.60%) 7 (1.40%) low severe 35 (7.00%) low mild 13 (2.60%) high mild 18 (3.60%) high severe expand_paletted(exec)/trns=no/src_bits=8/src_size=5461 time: [1.3668 µs 1.3669 µs 1.3670 µs] thrpt: [11.161 GiB/s 11.162 GiB/s 11.163 GiB/s] Found 53 outliers among 500 measurements (10.60%) 14 (2.80%) low severe 12 (2.40%) low mild 18 (3.60%) high mild 9 (1.80%) high severe expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461 time: [5.7850 µs 5.7859 µs 5.7867 µs] thrpt: [7.0312 GiB/s 7.0322 GiB/s 7.0333 GiB/s] Found 92 outliers among 500 measurements (18.40%) 3 (0.60%) low severe 36 (7.20%) low mild 32 (6.40%) high mild 21 (4.20%) high severe expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461 time: [1.3811 µs 1.3811 µs 1.3812 µs] thrpt: [14.729 GiB/s 14.730 GiB/s 14.731 GiB/s] Found 12 outliers among 500 measurements (2.40%) 1 (0.20%) low severe 1 (0.20%) low mild 6 (1.20%) high mild 4 (0.80%) high severe expand_paletted(ctor)/plte=256/trns=256 time: [139.95 ns 140.01 ns 140.08 ns] Found 111 outliers among 1000 measurements (11.10%) 80 (8.00%) low mild 21 (2.10%) high mild 10 (1.00%) high severe expand_paletted(ctor)/plte=224/trns=32 time: [138.72 ns 138.87 ns 139.02 ns] Found 9 outliers among 1000 measurements (0.90%) 9 (0.90%) high mild expand_paletted(ctor)/plte=16/trns=1 time: [49.913 ns 49.963 ns 50.013 ns] Found 22 outliers among 1000 measurements (2.20%) 1 (0.10%) low severe 11 (1.10%) low mild 10 (1.00%) high mild ```

Shnatsel commented 5 months ago

Just so you know, with -C target-cpu=znver4 your expand_8bit_into_rgb8 from this PR gets autovectorized into AVX-512 instructions: https://godbolt.org/z/3bTG8oYz3 And it absolutely tanks performance on the micro-benchmarks, with some of them taking more than 2x more time. It also regresses the end-to-end decoding benchmark for the zune-paletted file by 23%.

I suspect the expansion function ends up getting called a lot on very small chunks, so the vector code is wasted and only gets in the way. I'll try to confirm this hunch.

Shnatsel commented 5 months ago

Hmm, that function seems to be called for every row. I am surprised that autovectorization actually hurts it even on rather large images.

fintelia commented 5 months ago

Zen4 has a somewhat weird AVX-512 implementation, so I wonder if LLVM just isn't realizing that the autovectorized version will be slower.

As another datapoint, I tested on a Zen3 CPU and don't see any huge performance differences compiling with -C target-cpu=znver3

Shnatsel commented 5 months ago

I couldn't convince the compiler to remove the bounds checks from the loop. But it seems to speculate right past them without too much trouble.

A naive iterator version that removes bounds checks but copies 3 bytes instead of 4 is 25% slower than this code.

anforowicz commented 5 months ago

What machine did you run those benchmarks on?

AMD EPYC 7B12.

`lscpu` output

$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc c puid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3d nowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nr ip_save umip rdpid Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec rstack overflow: Vulnerable: Safe RET, no microcode Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

FWIW, I also didn't use -C target-cpu=native or something like that, because AFAIK Chromium builds and ships a single binary for a given target platform/architecture. AFAIU this still in general allows for some (lowest-common denominator) auto-vectorization (e.g. see https://github.com/image-rs/image-png/pull/414#issuecomment-1734635158).

Shnatsel commented 5 months ago

Ah, that runs at half the frequency of Zen4 desktop parts, so those numbers make sense. I did not use -C target-cpu=native either.

I suppose we could use multiversioning if wider SIMD helped, but at least in palette expansion it doesn't.

image-rs / image-png