Open guiburon opened 9 months ago
yeah, that looks reasonable enough to me. im surprised about the results though. could you share your benchmark setup?
Thanks for your input!
I am multiplying dense f64 rectangular matrices of size (20,000x8,000) and (8,000x4,000). I am preallocating the result matrix before the benchmark. The benchmark macro @btime
discards the first run (Julia is JIT) and then compute the average execution time and allocations over multiple runs.
Regarding the results I talked about, sorry I missed that OpenBLAS
ignored my thread count target. I rerun the benchmark on all of my available hardware threads to circumvent that.
EDIT: And the MKL ran on 6 threads so nothing was comparable. Sorry, new laptop, something is odd with my current setup.
I will do more rigorous and thorough benchmarking later in the week. On Intel hardware as well.
Hardware and software
12 threads run
❯ julia -t 12 --project=. benchmark.jl
--- OpenBLAS ---
3.862 s (0 allocations: 0 bytes)
--- MKL ---
4.638 s (0 allocations: 0 bytes)
--- faer ---
4.396 s (0 allocations: 0 bytes)
The MKL chooses to run on 6 threads according to htop
while OpenBLAS
and faer
use hyperthreading and run on 12 threads.
benchmark.jl
nthreads = Base.Threads.nthreads()
ENV["OPENBLAS_NUM_THREADS"] = nthreads # does not work
ENV["OMP_NUM_THREADS"] = nthreads # useless?
ENV["MKL_NUM_THREADS"] = nthreads
using LinearAlgebra
using BenchmarkTools
include("wrapper.jl")
using .faer
ma = 20_000
na = 8_000
mb = na
nb = 4_000
a = rand(Float64, ma, na)
b = rand(Float64, mb, nb)
# --- OpenBLAS ---
println("--- OpenBLAS ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mul!($c, $a, $b)
# --- MKL ---
using MKL
println("--- MKL ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mul!($c, $a, $b)
# --- faer ---
println("--- faer ---")
c = Matrix{Float64}(undef, ma, nb)
@btime mult!($c, $a, $b; nthreads=$nthreads)
one thing that could make a difference is building faer with the nightly
feature (and a nightly toolchain)
this enables avx512 instructions that are currently unstable, whereas openblas uses them by default
I switched to my desktop (AMD Ryzen 9 7950X3D 16C/32T, 64Go DDR5-6000) because I think my laptop may thermal throttle and artificially lower MKL
and faer
results. I also set the random seed for repeatability.
nightly
is worse right now unfortunately.
*(20,000 x 8,000) (8,000 x 4,000)**
rustc 1.78.0-nightly
❯ julia -t 32 --project=. benchmark.jl
--- OpenBLAS ---
1.166 s (0 allocations: 0 bytes)
--- MKL ---
1.369 s (0 allocations: 0 bytes)
--- faer ---
1.277 s (0 allocations: 0 bytes)
rustc 1.76.0
--- OpenBLAS ---
1.181 s (0 allocations: 0 bytes)
--- MKL ---
1.361 s (0 allocations: 0 bytes)
--- faer ---
1.197 s (0 allocations: 0 bytes)
*(40,000 x 16,000) (16,000 x 8,000)**
rustc 1.78.0-nightly
❯ julia -t 32 --project=. benchmark.jl
--- OpenBLAS ---
9.221 s (0 allocations: 0 bytes)
--- MKL ---
10.432 s (0 allocations: 0 bytes)
--- faer ---
10.384 s (0 allocations: 0 bytes)
rustc 1.76.0
--- OpenBLAS ---
9.278 s (0 allocations: 0 bytes)
--- MKL ---
10.369 s (0 allocations: 0 bytes)
--- faer ---
9.803 s (0 allocations: 0 bytes)
Are those results reasonable? They look good to me but I don't know the expected performance of faer
.
I will do different benchmarks when I have more time. Are you interested? And if so where should I share them?
the results look pretty reasonable to me. it's hard to know exactly what is making faer slower without taking a closer look. especially since im not able to bench on a wide variety of computers and the optimization settings can be tuned differently for each one.
FYI I ran the same benchmark on Intel hardware. I fixed my thread count problem: everything effectively run on 8 threads here.
faer
seems a lot less competitive on this hardware. nightly
is still worse.
Hardware and software
*(20,000 x 8,000) (8,000 x 4,000)**
rustc 1.78.0-nightly
❯ julia -t 8 --project benchmark.jl
--- OpenBLAS ---
2.326 s (0 allocations: 0 bytes)
--- MKL ---
2.082 s (0 allocations: 0 bytes)
--- faer ---
5.636 s (0 allocations: 0 bytes)
rustc 1.76.0
--- OpenBLAS ---
2.277 s (0 allocations: 0 bytes)
--- MKL ---
2.032 s (0 allocations: 0 bytes)
--- faer ---
4.662 s (0 allocations: 0 bytes)
what happens if you initialize the matrix instead of using undef
? do you still get the same results?
i just got an idea! what happens if you benchmark faer without any of the other libraries running? i vaguely remember some issues with openmp's threadpool interfering with rayon's, which caused significant slowdowns on faer's side of things
i would be curious to see those as well as single threaded results if that's alright with you
what happens if you initialize the matrix instead of using
undef
? do you still get the same results?
No change
i just got an idea! what happens if you benchmark faer without any of the other libraries running? i vaguely remember some issues with openmp's threadpool interfering with rayon's, which caused significant slowdowns on faer's side of things
No change
i would be curious to see those as well as single threaded results if that's alright with you
Large performance difference here! nightly
does not change anything this time. I monitored the CPU usage to make sure it was all monothread.
I might run the same benchmark on my AMD hardware later.
Hardware and software
*(20,000 x 8,000) (8,000 x 4,000)**
rustc 1.78.0-nightly Parallelism::Rayon(1)
❯ julia -t 1 --project=. benchmark.jl
--- OpenBLAS ---
15.627 s (0 allocations: 0 bytes)
--- MKL ---
16.032 s (0 allocations: 0 bytes)
--- faer ---
37.625 s (0 allocations: 0 bytes)
rustc 1.76.0 Parallelism::Rayon(1)
--- OpenBLAS ---
15.977 s (0 allocations: 0 bytes)
--- MKL ---
17.235 s (0 allocations: 0 bytes)
--- faer ---
37.824 s (0 allocations: 0 bytes)
rustc 1.78.0-nightly Parallelism::None
❯ julia -t 1 --project=. benchmark.jl
--- OpenBLAS ---
16.049 s (0 allocations: 0 bytes)
--- MKL ---
17.687 s (0 allocations: 0 bytes)
--- faer ---
41.687 s (0 allocations: 0 bytes)
rustc 1.76.0 Parallelism::None Not done.
yeah, no idea what's happening then. if you can share your full benchmark i can see if i can reproduce the results.
yeah, no idea what's happening then. if you can share your full benchmark i can see if i can reproduce the results.
https://github.com/guiburon/faer-api
FYI something seems odd right now with BLAS.set_num_threads
so I suggest monitoring the CPU usage to be sure OpenBLAS
runs on the requested thread count. You might have to export OMP_NUM_THREADS
before launching Julia if BLAS.set_num_threads
does not work.
I don't know if you are familiar with Julia. Don't hesitate to ask if you want some pointers.
i tried the benchmark and im getting close results for all 3 libraries
--- faer ---
5.089 s (0 allocations: 0 bytes)
--- OpenBLAS ---
4.978 s (0 allocations: 0 bytes)
--- MKL ---
4.887 s (0 allocations: 0 bytes)
one thing i noticed though, was that faer
seems to be running slower in julia than rust for some reason? in rust the timings range from 4.2s to 4.8s on my machine (i5-11400 @ 2.60GHz with 12 threads)
I ran the benchmark in monothread (Rayon(1)
) on my Ryzen 5 7640U @ 4.9GHz and got close results for all 3 libs.
❯ julia -t 1 --project=. benchmark.jl
--- faer ---
19.374 s (0 allocations: 0 bytes)
--- OpenBLAS ---
18.557 s (0 allocations: 0 bytes)
--- MKL ---
19.655 s (0 allocations: 0 bytes)
So the only hardware where faer
is far behind (both mono and multithread) is on that Xeon Gold 6136? It does not seem to be due to Intel hardware judging by your i5 results. Maybe it's due to WSL but I can't easily run the benchmark directly on Windows. I will exclude that hardware from my benchmarks for now.
Hi!
I am really impressed by your colossal work on this math kernel! I am writing a Julia wrapper to benchmark faer against OpenBLAS and MKL.
So far I have only studied the dense matrix-matrix multiplication. My preliminary results show faer approximately 50% slower than OpenBLAS and 25% slower than MKL on an AMD Ryzen 5 7640U on 8 threads.
This is basically my first Rust project and I want to be fair to faer: is this a reasonable dynamic library exposing faer inplace matrix multiplication using the C ABI?
I am not sure if opening an issue is the right way to ask but the faer documentation is very sparse at the moment on how to import external matrices.