I optimized the initial AVX 512 GEMM kernel based on what works best on my 2020 Intel MacBook Pro i5. This is an Ice Lake client architecture system which has a single 512-bit FMA unit. When testing on a C6i-xlarge system in AWS (Intel Xeon, Ice Lake Server) I found that doubling the Avx512Kernel::MR constant from 6 to 12 gave a substantial boost, and actually going to 14 is better still. Any more is slower (we run out of zmm registers). The server CPU has two 512-bit FMA units.
This means that for optimal performance on a range of systems, multiple sizes of AVX-512 kernels are needed and some mechanism to choose them. Section 18.1 "Severs with a single FMA unit" in the Intel Optimization Manual has some code to detect the FMA unit count, but it relies on a microbenchmark. Google's cpu_features library has some code that relies on detecting specific CPU models.
In addition to increasing the tile size, the output prefetching logic probably also needs adjusting to work with a larger kernel. Currently we prefetch all rows of the output tile in a single loop before the final outer product, but when MR gets large this is inefficient. What you're supposed to do is interleave prefetching with computation.
Here are some concrete numbers for an benchmark, using:
Baseline performance of AVX-512 kernel (with MR=6) on C6i-xlarge: ~180 GFLOPS
Intel MKL performance with 2-4 threads (using gemm-benchmark): ~308 GFLOPS
BLIS performance: ~260 GFLOPS
This library's AVX-512 kernel with MR=14: ~232 GFLOPS
As above, + compile with RUSTFLAGS="-C target-cpu=native": ~239 GFLOPS
Looking at a report generated by perf, the A-block packing code is showing up as expensive (~14% of runtime with default CPU, ~12% with target-cpu=native). This is not surprising since it logically involves reading MRxMR-sized blocks from A, transposing them and writing them to the packing buffer. I looked at this in https://github.com/robertknight/wasnn/issues/16 and didn't find an easy win with a smaller MR, but perhaps worth revisiting for larger MR.
The C6i instances also support the AVX-VNNI extension (aka. "Deep Learning Boost"). Ultimately being able to exploit that would get the most out of them.
I optimized the initial AVX 512 GEMM kernel based on what works best on my 2020 Intel MacBook Pro i5. This is an Ice Lake client architecture system which has a single 512-bit FMA unit. When testing on a C6i-xlarge system in AWS (Intel Xeon, Ice Lake Server) I found that doubling the
Avx512Kernel::MR
constant from 6 to 12 gave a substantial boost, and actually going to 14 is better still. Any more is slower (we run out of zmm registers). The server CPU has two 512-bit FMA units.This means that for optimal performance on a range of systems, multiple sizes of AVX-512 kernels are needed and some mechanism to choose them. Section 18.1 "Severs with a single FMA unit" in the Intel Optimization Manual has some code to detect the FMA unit count, but it relies on a microbenchmark. Google's cpu_features library has some code that relies on detecting specific CPU models.
In addition to increasing the tile size, the output prefetching logic probably also needs adjusting to work with a larger kernel. Currently we prefetch all rows of the output tile in a single loop before the final outer product, but when
MR
gets large this is inefficient. What you're supposed to do is interleave prefetching with computation.Here are some concrete numbers for an benchmark, using:
And taking the number for M=N=K=1024.
MR=6
) on C6i-xlarge: ~180 GFLOPSMR=14
: ~232 GFLOPSRUSTFLAGS="-C target-cpu=native"
: ~239 GFLOPSLooking at a report generated by
perf
, the A-block packing code is showing up as expensive (~14% of runtime with default CPU, ~12% withtarget-cpu=native
). This is not surprising since it logically involves readingMRxMR
-sized blocks from A, transposing them and writing them to the packing buffer. I looked at this in https://github.com/robertknight/wasnn/issues/16 and didn't find an easy win with a smallerMR
, but perhaps worth revisiting for larger MR.