Guidelines for efficient faer dynamic library

guiburon commented 9 months ago

Hi!

I am really impressed by your colossal work on this math kernel! I am writing a Julia wrapper to benchmark faer against OpenBLAS and MKL.

So far I have only studied the dense matrix-matrix multiplication. My preliminary results show faer approximately 50% slower than OpenBLAS and 25% slower than MKL on an AMD Ryzen 5 7640U on 8 threads.

This is basically my first Rust project and I want to be fair to faer: is this a reasonable dynamic library exposing faer inplace matrix multiplication using the C ABI?

I am not sure if opening an issue is the right way to ask but the faer documentation is very sparse at the moment on how to import external matrices.

use faer::modules::core::mul::matmul;
use faer::{mat, Parallelism};
use std::usize;

// inplace c = a * b
#[no_mangle]
pub unsafe extern "C" fn mult(
    c_ptr: *mut f64,
    c_nrows: u64,
    c_ncols: u64,
    c_row_stride: u64,
    c_col_stride: u64,
    a_ptr: *const f64,
    a_nrows: u64,
    a_ncols: u64,
    a_row_stride: u64,
    a_col_stride: u64,
    b_ptr: *const f64,
    b_nrows: u64,
    b_ncols: u64,
    b_row_stride: u64,
    b_col_stride: u64,
    nthreads: u32,
) {
    assert!(!c_ptr.is_null());
    assert!(!a_ptr.is_null());
    assert!(!b_ptr.is_null());

    let c = unsafe {
        mat::from_raw_parts_mut::<f64>(
            c_ptr,
            c_nrows as usize,
            c_ncols as usize,
            c_row_stride as isize,
            c_col_stride as isize,
        )
    };

    let a = unsafe {
        mat::from_raw_parts::<f64>(
            a_ptr,
            a_nrows as usize,
            a_ncols as usize,
            a_row_stride as isize,
            a_col_stride as isize,
        )
    };

    let b = unsafe {
        mat::from_raw_parts::<f64>(
            b_ptr,
            b_nrows as usize,
            b_ncols as usize,
            b_row_stride as isize,
            b_col_stride as isize,
        )
    };

    matmul(c, a, b, None, 1.0, Parallelism::Rayon(nthreads as usize));
}

sarah-quinones commented 9 months ago

yeah, that looks reasonable enough to me. im surprised about the results though. could you share your benchmark setup?

guiburon commented 9 months ago

Thanks for your input!

I am multiplying dense f64 rectangular matrices of size (20,000x8,000) and (8,000x4,000). I am preallocating the result matrix before the benchmark. The benchmark macro @btime discards the first run (Julia is JIT) and then compute the average execution time and allocations over multiple runs.

Regarding the results I talked about, sorry I missed that OpenBLAS ignored my thread count target. I rerun the benchmark on all of my available hardware threads to circumvent that. EDIT: And the MKL ran on 6 threads so nothing was comparable. Sorry, new laptop, something is odd with my current setup.

I will do more rigorous and thorough benchmarking later in the week. On Intel hardware as well.

Hardware and software

AMD Ryzen 5 7640U: 6 cores/12 threads
32Go DDR5 5600 MHz
Linux 6.7.6-arch1-1
Julia 1.10
rustc 1.76.0
faer 0.17.1

12 threads run

❯ julia -t 12 --project=. benchmark.jl
--- OpenBLAS ---
  3.862 s (0 allocations: 0 bytes)
--- MKL ---
  4.638 s (0 allocations: 0 bytes)
--- faer ---
  4.396 s (0 allocations: 0 bytes)

The MKL chooses to run on 6 threads according to htop while OpenBLAS and faer use hyperthreading and run on 12 threads.

benchmark.jl

nthreads = Base.Threads.nthreads()

ENV["OPENBLAS_NUM_THREADS"] = nthreads    # does not work
ENV["OMP_NUM_THREADS"] = nthreads    # useless?
ENV["MKL_NUM_THREADS"] = nthreads

using LinearAlgebra
using BenchmarkTools

include("wrapper.jl")
using .faer

ma = 20_000
na = 8_000

mb = na
nb = 4_000

a = rand(Float64, ma, na)
b = rand(Float64, mb, nb)

# --- OpenBLAS ---
println("--- OpenBLAS ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mul!($c, $a, $b)

# --- MKL ---
using MKL
println("--- MKL ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mul!($c, $a, $b)

# --- faer ---
println("--- faer ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mult!($c, $a, $b; nthreads=$nthreads)

sarah-quinones commented 9 months ago

one thing that could make a difference is building faer with the nightly feature (and a nightly toolchain) this enables avx512 instructions that are currently unstable, whereas openblas uses them by default

guiburon commented 9 months ago

I switched to my desktop (AMD Ryzen 9 7950X3D 16C/32T, 64Go DDR5-6000) because I think my laptop may thermal throttle and artificially lower MKL and faer results. I also set the random seed for repeatability.

nightly is worse right now unfortunately.

*(20,000 x 8,000) (8,000 x 4,000)**

rustc 1.78.0-nightly

❯ julia -t 32 --project=. benchmark.jl
--- OpenBLAS ---
  1.166 s (0 allocations: 0 bytes)
--- MKL ---
  1.369 s (0 allocations: 0 bytes)
--- faer ---
  1.277 s (0 allocations: 0 bytes)

rustc 1.76.0

--- OpenBLAS ---
  1.181 s (0 allocations: 0 bytes)
--- MKL ---
  1.361 s (0 allocations: 0 bytes)
--- faer ---
  1.197 s (0 allocations: 0 bytes)

*(40,000 x 16,000) (16,000 x 8,000)**

rustc 1.78.0-nightly

❯ julia -t 32 --project=. benchmark.jl
--- OpenBLAS ---
  9.221 s (0 allocations: 0 bytes)
--- MKL ---
  10.432 s (0 allocations: 0 bytes)
--- faer ---
  10.384 s (0 allocations: 0 bytes)

rustc 1.76.0

--- OpenBLAS ---
  9.278 s (0 allocations: 0 bytes)
--- MKL ---
  10.369 s (0 allocations: 0 bytes)
--- faer ---
  9.803 s (0 allocations: 0 bytes)

Are those results reasonable? They look good to me but I don't know the expected performance of faer.

I will do different benchmarks when I have more time. Are you interested? And if so where should I share them?

sarah-quinones commented 9 months ago

the results look pretty reasonable to me. it's hard to know exactly what is making faer slower without taking a closer look. especially since im not able to bench on a wide variety of computers and the optimization settings can be tuned differently for each one.

guiburon commented 9 months ago

FYI I ran the same benchmark on Intel hardware. I fixed my thread count problem: everything effectively run on 8 threads here.

faer seems a lot less competitive on this hardware. nightly is still worse.

Hardware and software

Intel Xeon Gold 6136: 12 cores/12 threads (hyperthreading disabled)
64Go DDR4-2666
Linux WSL2 5.15.133.1
Julia 1.10
faer 0.17.1

*(20,000 x 8,000) (8,000 x 4,000)**

rustc 1.78.0-nightly

❯ julia -t 8 --project benchmark.jl
--- OpenBLAS ---
  2.326 s (0 allocations: 0 bytes)
--- MKL ---
  2.082 s (0 allocations: 0 bytes)
--- faer ---
  5.636 s (0 allocations: 0 bytes)

rustc 1.76.0

--- OpenBLAS ---
  2.277 s (0 allocations: 0 bytes)
--- MKL ---
  2.032 s (0 allocations: 0 bytes)
--- faer ---
  4.662 s (0 allocations: 0 bytes)

sarah-quinones commented 9 months ago

what happens if you initialize the matrix instead of using undef? do you still get the same results?

sarah-quinones commented 9 months ago

i just got an idea! what happens if you benchmark faer without any of the other libraries running? i vaguely remember some issues with openmp's threadpool interfering with rayon's, which caused significant slowdowns on faer's side of things

i would be curious to see those as well as single threaded results if that's alright with you

guiburon commented 9 months ago

what happens if you initialize the matrix instead of using undef? do you still get the same results?

No change

i just got an idea! what happens if you benchmark faer without any of the other libraries running? i vaguely remember some issues with openmp's threadpool interfering with rayon's, which caused significant slowdowns on faer's side of things

No change

i would be curious to see those as well as single threaded results if that's alright with you

Large performance difference here! nightly does not change anything this time. I monitored the CPU usage to make sure it was all monothread. I might run the same benchmark on my AMD hardware later.

Hardware and software

Intel Xeon Gold 6136: 12 cores/12 threads (hyperthreading disabled)
64Go DDR4-2666
Linux WSL2 5.15.133.1
Julia 1.10
faer 0.17.1

*(20,000 x 8,000) (8,000 x 4,000)**

rustc 1.78.0-nightly Parallelism::Rayon(1)

❯ julia -t 1 --project=. benchmark.jl
--- OpenBLAS ---
  15.627 s (0 allocations: 0 bytes)
--- MKL ---
  16.032 s (0 allocations: 0 bytes)
--- faer ---
  37.625 s (0 allocations: 0 bytes)

rustc 1.76.0 Parallelism::Rayon(1)

--- OpenBLAS ---
  15.977 s (0 allocations: 0 bytes)
--- MKL ---
  17.235 s (0 allocations: 0 bytes)
--- faer ---
  37.824 s (0 allocations: 0 bytes)

rustc 1.78.0-nightly Parallelism::None

❯ julia -t 1 --project=. benchmark.jl
--- OpenBLAS ---
  16.049 s (0 allocations: 0 bytes)
--- MKL ---
  17.687 s (0 allocations: 0 bytes)
--- faer ---
  41.687 s (0 allocations: 0 bytes)

rustc 1.76.0 Parallelism::None Not done.

sarah-quinones commented 9 months ago

yeah, no idea what's happening then. if you can share your full benchmark i can see if i can reproduce the results.

guiburon commented 9 months ago

yeah, no idea what's happening then. if you can share your full benchmark i can see if i can reproduce the results.

https://github.com/guiburon/faer-api

FYI something seems odd right now with BLAS.set_num_threads so I suggest monitoring the CPU usage to be sure OpenBLAS runs on the requested thread count. You might have to export OMP_NUM_THREADS before launching Julia if BLAS.set_num_threads does not work.

I don't know if you are familiar with Julia. Don't hesitate to ask if you want some pointers.

sarah-quinones commented 9 months ago

i tried the benchmark and im getting close results for all 3 libraries

--- faer ---
  5.089 s (0 allocations: 0 bytes)
--- OpenBLAS ---
  4.978 s (0 allocations: 0 bytes)
--- MKL ---
  4.887 s (0 allocations: 0 bytes)

one thing i noticed though, was that faer seems to be running slower in julia than rust for some reason? in rust the timings range from 4.2s to 4.8s on my machine (i5-11400 @ 2.60GHz with 12 threads)

guiburon commented 9 months ago

I ran the benchmark in monothread (Rayon(1)) on my Ryzen 5 7640U @ 4.9GHz and got close results for all 3 libs.

❯ julia -t 1 --project=. benchmark.jl
--- faer ---
  19.374 s (0 allocations: 0 bytes)
--- OpenBLAS ---
  18.557 s (0 allocations: 0 bytes)
--- MKL ---
  19.655 s (0 allocations: 0 bytes)

So the only hardware where faer is far behind (both mono and multithread) is on that Xeon Gold 6136? It does not seem to be due to Intel hardware judging by your i5 results. Maybe it's due to WSL but I can't easily run the benchmark directly on Windows. I will exclude that hardware from my benchmarks for now.

sarah-quinones / faer-rs

Guidelines for efficient faer dynamic library #108