Add horizontal reduction

It's also convenient to sum a vector (horizontal add). This is in general not what you want to do at all and should be avoided. But generally this operation comes at the end of the algorithm.

LLVM offers this method explicitly which can be coded with something like..

@inline @generated function reduce_fadd(x::LVec{N, T}) where {N, T <: FloatTypes}
    ff = string("llvm.vector.reduce.fadd.v", N, T)
    s = """
    declare $(LLVMType[T]) @$ff($(LLVMType[T]), <$N x $(LLVMType[T])>)
    define $(LLVMType[T]) @entry(<$N x $(LLVMType[T])>) #0 {
    top:
        %1 = call reassoc $(LLVMType[T]) @$ff($(LLVMType[T]) -0.0, <$N x $(LLVMType[T])> %0)
        ret $(LLVMType[T]) %1
    }
    """
    return :(
        llvmcall($(s, "entry"), T, Tuple{LVec{N, T}}, x)
        )
end

The issue is that for some reason LLVM is really slow. Even with the reassoc flag it doesn't seem to implement optimal reduction.

For example, here is the reduction of a real vector of length 16 using the llvm reduction.

julia> @benchmark SIMDMath.reduce_fadd3($a.data)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.666 ns … 13.833 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.750 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.767 ns ±  0.235 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▃           █          ▆           ▂            ▁
  ▃▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ █
  3.67 ns      Histogram: log(frequency) by time     3.88 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

And here is the reduction of a COMPLEX vector of length 16 using this method.

julia> @benchmark SIMDMath.reduce_fadd($a)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.708 ns … 28.041 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.833 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.834 ns ±  0.371 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▇          █           ▂          ▁ ▁
  ▃▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ █
  2.71 ns      Histogram: log(frequency) by time     2.92 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

It doesn't seem like much of course but it's also doing double the amount of work.

Anyway, I've gone with two different implementations. An optimal approach if the length of the vector is a power of two and then a fallback approach to just summing them up if not a power of two. This I think is ok because your vectors should generally be as wide as the register width so we will usually always hit the optimimal path. Of course, this method assumes that the order can be summed in any way. This is the general assumption we make with the whole library.

Going to update with new name and add the real version

Impacted Files	Coverage Δ
src/complex.jl	`100.00% <100.00%> (ø)`
src/interface.jl	`100.00% <100.00%> (ø)`

heltonmc / SIMDMath.jl

Add horizontal reduction #11

Codecov Report