Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units

I optimized the initial AVX 512 GEMM kernel based on what works best on my 2020 Intel MacBook Pro i5. This is an Ice Lake client architecture system which has a single 512-bit FMA unit. When testing on a C6i-xlarge system in AWS (Intel Xeon, Ice Lake Server) I found that doubling the Avx512Kernel::MR constant from 6 to 12 gave a substantial boost, and actually going to 14 is better still. Any more is slower (we run out of zmm registers). The server CPU has two 512-bit FMA units.

This means that for optimal performance on a range of systems, multiple sizes of AVX-512 kernels are needed and some mechanism to choose them. Section 18.1 "Severs with a single FMA unit" in the Intel Optimization Manual has some code to detect the FMA unit count, but it relies on a microbenchmark. Google's cpu_features library has some code that relies on detecting specific CPU models.

In addition to increasing the tile size, the output prefetching logic probably also needs adjusting to work with a larger kernel. Currently we prefetch all rows of the output tile in a single loop before the final outer product, but when MR gets large this is inefficient. What you're supposed to do is interleave prefetching with computation.

Here are some concrete numbers for an benchmark, using:

cargo +nightly test -p wasnn --features avx512 --release bench_gemm -- --nocapture --ignored

And taking the number for M=N=K=1024.

Baseline performance of AVX-512 kernel (with MR=6) on C6i-xlarge: ~180 GFLOPS
Intel MKL performance with 2-4 threads (using gemm-benchmark): ~308 GFLOPS
BLIS performance: ~260 GFLOPS
This library's AVX-512 kernel with MR=14: ~232 GFLOPS
As above, + compile with RUSTFLAGS="-C target-cpu=native": ~239 GFLOPS

Looking at a report generated by perf, the A-block packing code is showing up as expensive (~14% of runtime with default CPU, ~12% with target-cpu=native). This is not surprising since it logically involves reading MRxMR-sized blocks from A, transposing them and writing them to the packing buffer. I looked at this in https://github.com/robertknight/wasnn/issues/16 and didn't find an easy win with a smaller MR, but perhaps worth revisiting for larger MR.

robertknight / rten

Include a larger AVX512 GEMM kernel for (server) CPUs with 2 FMA units #17