heltonmc / SIMDMath.jl

Lightweight SIMD routines for special function evaluation
MIT License
10 stars 1 forks source link

Add horizontal reduction #11

Closed heltonmc closed 1 year ago

heltonmc commented 1 year ago

It's also convenient to sum a vector (horizontal add). This is in general not what you want to do at all and should be avoided. But generally this operation comes at the end of the algorithm.

LLVM offers this method explicitly which can be coded with something like..

@inline @generated function reduce_fadd(x::LVec{N, T}) where {N, T <: FloatTypes}
    ff = string("llvm.vector.reduce.fadd.v", N, T)
    s = """
    declare $(LLVMType[T]) @$ff($(LLVMType[T]), <$N x $(LLVMType[T])>)
    define $(LLVMType[T]) @entry(<$N x $(LLVMType[T])>) #0 {
    top:
        %1 = call reassoc $(LLVMType[T]) @$ff($(LLVMType[T]) -0.0, <$N x $(LLVMType[T])> %0)
        ret $(LLVMType[T]) %1
    }
    """
    return :(
        llvmcall($(s, "entry"), T, Tuple{LVec{N, T}}, x)
        )
end

The issue is that for some reason LLVM is really slow. Even with the reassoc flag it doesn't seem to implement optimal reduction.

For example, here is the reduction of a real vector of length 16 using the llvm reduction.

julia> @benchmark SIMDMath.reduce_fadd3($a.data)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.666 ns … 13.833 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.750 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.767 ns ±  0.235 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▃           █          ▆           ▂            ▁
  ▃▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ █
  3.67 ns      Histogram: log(frequency) by time     3.88 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

And here is the reduction of a COMPLEX vector of length 16 using this method.

julia> @benchmark SIMDMath.reduce_fadd($a)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.708 ns … 28.041 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.833 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.834 ns ±  0.371 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▇          █           ▂          ▁ ▁
  ▃▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ █
  2.71 ns      Histogram: log(frequency) by time     2.92 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

It doesn't seem like much of course but it's also doing double the amount of work.

Anyway, I've gone with two different implementations. An optimal approach if the length of the vector is a power of two and then a fallback approach to just summing them up if not a power of two. This I think is ok because your vectors should generally be as wide as the register width so we will usually always hit the optimimal path. Of course, this method assumes that the order can be summed in any way. This is the general assumption we make with the whole library.

Going to update with new name and add the real version

codecov-commenter commented 1 year ago

Codecov Report

Merging #11 (22c996c) into main (aa39de5) will increase coverage by 0.84%. The diff coverage is 100.00%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
+ Coverage   92.85%   93.70%   +0.84%     
==========================================
  Files           6        6              
  Lines         238      270      +32     
==========================================
+ Hits          221      253      +32     
  Misses         17       17              
Impacted Files Coverage Δ
src/complex.jl 100.00% <100.00%> (ø)
src/interface.jl 100.00% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more