heltonmc / SIMDMath.jl

Lightweight SIMD routines for special function evaluation
MIT License
10 stars 1 forks source link

Potentially add vector dot product? #12

Closed heltonmc closed 1 year ago

heltonmc commented 1 year ago

Motivated a little bit by the question in slack on computing dot products of complex vectors. I don't really want this library to be a blas type of thing as I'd like to keep it lightweight and support the needs of Bessels but I also understand that this library is now very well suited for these type of things and I think the only SIMD library to explicitly support complex SIMD vectors.

Anyway, it is of course exceptionally easy to beat the base linear algebra routines..

 x = ntuple(i -> (complex(rand(), rand())), 4*50 + 3);
 y = ntuple(i -> (complex(rand(), rand())), 4*50 + 3);
xr = real.(x);
yr = real.(y);

julia> @benchmark SIMDMath.dot($x, $y)
BenchmarkTools.Trial: 10000 samples with 952 evaluations.
 Range (min … max):  96.988 ns … 137.780 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     97.120 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   97.437 ns ±   1.879 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅▃                             ▁▃                          ▁
  ████▇▆▅▄▅▄▄▃▄▃▃▃▅▃▃▄▃▄▄▁▃▄▁▁▃▃▁▃███▇▅▃▃▄▄▃▄▄▁▄▄▄▅▆▆▅▆▆▆▅▃▃▁▃ █
  97 ns         Histogram: log(frequency) by time       102 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark LinearAlgebra.dot($x, $y)
BenchmarkTools.Trial: 10000 samples with 714 evaluations.
 Range (min … max):  176.471 ns … 236.636 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     176.646 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   177.269 ns ±   3.263 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆                ▂▄                                          ▁
  ███▆▅▅▅▅▅▅▅▆▅▆▁▁▁▆██▅▅▅▄▃▄▆▅▆█▇▇▃▃▁▄▄▁▁▄▁▁▁▁▁▃▃▃▁▃▃▄▁▁▃▁▄▃▁▃▃ █
  176 ns        Histogram: log(frequency) by time        189 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark SIMDMath.dot($xr, $yr)
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (min … max):  31.607 ns … 66.230 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.732 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.837 ns ±  1.018 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁█▃                                                         
  ▃███▄▃▂▂▂▂▁▂▂▁▁▁▂▂▁▁▁▁▂▂▂▂▂▁▁▂▁▁▁▂▁▂▁▁▂▁▁▂▁▁▁▁▂▁▁▁▁▂▁▁▂▂▂▂▂ ▂
  31.6 ns         Histogram: frequency by time        34.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark LinearAlgebra.dot($xr, $yr)
BenchmarkTools.Trial: 10000 samples with 792 evaluations.
 Range (min … max):  159.197 ns … 234.164 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     159.564 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   160.135 ns ±   2.923 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄█▄                    ▃▂                                    ▁
  ▃███▇▅▆▃▃▅▃▅▆▇▅▆▅▄▃▄▁▄▄███▆▅▃▄▄▄▅▄▅▆▇▆▇▇▄▅▅▃▁▃▃▃▄▁▁▄▁▄▅▅▄▄▄▅▅ █
  159 ns        Histogram: log(frequency) by time        169 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So 1.8x faster for complex dot products and 5x faster for real dot products. And this is just on my apple to support 2x wide simd.

It might be better to also look at Float32 to fit more numbers in...

julia> x32 = ComplexF32.(x);

julia> y32 = ComplexF32.(y);

julia> @benchmark LinearAlgebra.dot($x32, $y32)
BenchmarkTools.Trial: 10000 samples with 719 evaluations.
 Range (min … max):  176.751 ns … 254.811 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     176.925 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   177.815 ns ±   3.427 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▁             ▂▄                                       ▂    ▂
  ███▇▆▆▇▅▆▆▄▆▅▄▁▄███▅▅▅▅▆▆▇▇██▆▅▁▅▄▄▄▅▅▃▄▁▄▅▄▃▃▁▃▅▃▃▃▁▃▁▁▄█▇▆▅ █
  177 ns        Histogram: log(frequency) by time        191 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark SIMDMath.dot($x32, $y32)
BenchmarkTools.Trial: 10000 samples with 987 evaluations.
 Range (min … max):  49.856 ns … 83.249 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.983 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.158 ns ±  1.244 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅▃                                ▂                       ▁
  ████▇▄▁▄▁▁▃▁▃▄▄▄▁▅▄▃▃▁▄▅▃▃▃▁▃▁▄▃▁▁▆██▅▅▄▃▁▄▄▃▁▄▄▃▁▁▄▆▅▅▅▅▆▄ █
  49.9 ns      Histogram: log(frequency) by time      54.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

# for reals
julia> xr32 = ComplexF32.(xr);

julia> yr32 = ComplexF32.(yr);

julia> @benchmark LinearAlgebra.dot($xr32, $yr32)
BenchmarkTools.Trial: 10000 samples with 714 evaluations.
 Range (min … max):  176.703 ns … 241.539 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     176.880 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   177.596 ns ±   3.281 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▂               ▁▄▁                                         ▁
  ███▆▆▄▄▄▅▃▆▅▆▅▃▄▁▅███▅▁▃▄▄▅▅▆▇▇▇█▅▄▁▃▁▄▄▄▁▃▁▁▃▃▅▅▆▆▆▅▄▅▄▄▁▄▁▄ █
  177 ns        Histogram: log(frequency) by time        189 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark SIMDMath.dot($xr32, $yr32)
BenchmarkTools.Trial: 10000 samples with 988 evaluations.
 Range (min … max):  49.848 ns … 91.051 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.975 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   50.173 ns ±  1.564 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▆▃▂                                   ▂                   ▁
  █████▅▃▃▄▁▁▁▁▁▃▄▁▁▃▃▃▃▃▁▃▅▄▄▁▁▁▃▁▁▁▁▁▃▇▇█▇▁▅▃▁▃▃▃▄▄▁▃▄▅▄▄▅▅ █
  49.8 ns      Histogram: log(frequency) by time      54.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So now ComplexF32 is 3.2x faster. This should scale better AVX and AVX-512 of course to see much more speed ups. But I'll have to adjust the SIMD width a little bit. I'm away from my AVX computer so Ill have to do that later.

This also beats BLAS pretty convincingly as well. Of course we are unrolling everything so that helps beat it.

heltonmc commented 1 year ago

Ya for Float16 you can really speed things up by unrolling by 8 on my laptop...

julia> @benchmark SIMDMath.dot($x16, $y16)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  27.052 ns … 59.673 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.178 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.253 ns ±  0.843 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▆▃▃▁                                                     ▂
  ███████▇▁▃▁▁▄▃▁▁▁▁▁▁▃▅▅▄▁▁▁▁▁▁▃▁▁▃▃▁▄▃▁▁▁▁▃▁▃▃▃▁▃▁▁▁▁▁▃▁▃▁▆ █
  27.1 ns      Histogram: log(frequency) by time      29.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark LinearAlgebra.dot($x16, $y16)
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
 Range (min … max):  556.898 ns …  2.928 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     557.349 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   559.288 ns ± 25.496 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▅▃                               ▂▃                        ▁
  ████▇▆▄▁▁▃▃▄▃▃▄▅▄▄▃▄▄▅▅▅▄▅▁▁▁▃▃▁▁████▅▄▄▃▁▁▁▃▁▁▄▁▄▄▃▄▄▅▅▅▅▆▆ █
  557 ns        Histogram: log(frequency) by time       582 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So that reduces the compute time compared to ComplexF32 by another factor of 2. There is something odd with ComplexF16 version in LinearAlgebra. Seems like a bad bug to me but unsure. Anyway the algorithm needs to be modified for unroll factor.

But anyway, computing the dot product of a complex vector of length 203 in 27 ns is pretty good....

codecov-commenter commented 1 year ago

Codecov Report

Merging #12 (511e393) into main (47ea095) will decrease coverage by 5.25%. The diff coverage is 0.00%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main      #12      +/-   ##
==========================================
- Coverage   93.70%   88.46%   -5.25%     
==========================================
  Files           6        7       +1     
  Lines         270      286      +16     
==========================================
  Hits          253      253              
- Misses         17       33      +16     
Impacted Files Coverage Δ
src/SIMDMath.jl 100.00% <ø> (ø)
src/complex.jl 98.11% <0.00%> (-1.89%) :arrow_down:
src/linearalgebra.jl 0.00% <0.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

heltonmc commented 1 year ago

I've added the method for arbitrary unroll length now. The new version is very fast and works well for any type of argument (Float16, Float32, Float64). The thing I want to spend some time on is determining what the unroll should be... For example,

julia> @benchmark SIMDMath.dot($xr16, $yr16)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  7.083 ns … 30.041 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.208 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.222 ns ±  0.430 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              █▁                              
  ▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▂ ▂
  7.08 ns        Histogram: frequency by time        7.33 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> length(xr16)
203

This computes the dot product of real vectors of length 203 using Float16 values in 7 ns.... Similarly for Float64,

julia> @benchmark SIMDMath.dot($xr, $yr)
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
 Range (min … max):  22.884 ns … 46.687 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.009 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.090 ns ±  0.793 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁ ▆  █ ▇  ▅ ▃  ▂                                          ▂
  ▃▁█▁█▁▁█▁█▁▁█▁█▁▁█▁█▁▁▅▁▄▁▁▃▁▃▁▃▁▁▁▁▁▁▁▃▁▄▁▁▁▁▁▁▁▁▁▄▁▁▅▁▅▁▃ █
  22.9 ns      Histogram: log(frequency) by time      23.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Computes in 23 ns. This is so far the fastest I've benchmarked against (StaticArrays, OpenBlas, LoopVectorization)

heltonmc commented 1 year ago

The new version gives a fully generally form. Still dubious about actually including this here but will leave it open for a bit. It's at least a good reference on how to build on top of the library. The one advantage this approach has on my Mac is the support for Float16 it is twice as fast as any other approach. This version produces almost identical machine code to

function dot(a, b)
    s = zero(eltype(a))
    @simd for i ∈ eachindex(a, b)
        s += conj(a[i]) * b[i]
    end
    s
end

Of course it is clearer on what can be done and could be optimized further. Anyway, the main advantage is for Float16 and ComplexF16 values which can provide very large improvements. This process is dominated by memory operations so It really isn't a great example of improving speeds too much. Any kind of optimization won't make too large of a speed increase. Might be more interesting in a more computation dense problem like matrix multiplication but I'm not super interested in that case.