Closed heltonmc closed 1 year ago
Ya for Float16 you can really speed things up by unrolling by 8 on my laptop...
julia> @benchmark SIMDMath.dot($x16, $y16)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min … max): 27.052 ns … 59.673 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.178 ns ┊ GC (median): 0.00%
Time (mean ± σ): 27.253 ns ± 0.843 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▇█▆▃▃▁ ▂
███████▇▁▃▁▁▄▃▁▁▁▁▁▁▃▅▅▄▁▁▁▁▁▁▃▁▁▃▃▁▄▃▁▁▁▁▃▁▃▃▃▁▃▁▁▁▁▁▃▁▃▁▆ █
27.1 ns Histogram: log(frequency) by time 29.9 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark LinearAlgebra.dot($x16, $y16)
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
Range (min … max): 556.898 ns … 2.928 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 557.349 ns ┊ GC (median): 0.00%
Time (mean ± σ): 559.288 ns ± 25.496 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▅▃ ▂▃ ▁
████▇▆▄▁▁▃▃▄▃▃▄▅▄▄▃▄▄▅▅▅▄▅▁▁▁▃▃▁▁████▅▄▄▃▁▁▁▃▁▁▄▁▄▄▃▄▄▅▅▅▅▆▆ █
557 ns Histogram: log(frequency) by time 582 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
So that reduces the compute time compared to ComplexF32 by another factor of 2. There is something odd with ComplexF16 version in LinearAlgebra. Seems like a bad bug to me but unsure. Anyway the algorithm needs to be modified for unroll factor.
But anyway, computing the dot product of a complex vector of length 203 in 27 ns is pretty good....
Merging #12 (511e393) into main (47ea095) will decrease coverage by
5.25%
. The diff coverage is0.00%
.
:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more
@@ Coverage Diff @@
## main #12 +/- ##
==========================================
- Coverage 93.70% 88.46% -5.25%
==========================================
Files 6 7 +1
Lines 270 286 +16
==========================================
Hits 253 253
- Misses 17 33 +16
Impacted Files | Coverage Δ | |
---|---|---|
src/SIMDMath.jl | 100.00% <ø> (ø) |
|
src/complex.jl | 98.11% <0.00%> (-1.89%) |
:arrow_down: |
src/linearalgebra.jl | 0.00% <0.00%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
I've added the method for arbitrary unroll length now. The new version is very fast and works well for any type of argument (Float16, Float32, Float64). The thing I want to spend some time on is determining what the unroll should be... For example,
julia> @benchmark SIMDMath.dot($xr16, $yr16)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 7.083 ns … 30.041 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.208 ns ┊ GC (median): 0.00%
Time (mean ± σ): 7.222 ns ± 0.430 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁
▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▂ ▂
7.08 ns Histogram: frequency by time 7.33 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> length(xr16)
203
This computes the dot product of real vectors of length 203 using Float16 values in 7 ns.... Similarly for Float64,
julia> @benchmark SIMDMath.dot($xr, $yr)
BenchmarkTools.Trial: 10000 samples with 996 evaluations.
Range (min … max): 22.884 ns … 46.687 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 23.009 ns ┊ GC (median): 0.00%
Time (mean ± σ): 23.090 ns ± 0.793 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▆ █ ▇ ▅ ▃ ▂ ▂
▃▁█▁█▁▁█▁█▁▁█▁█▁▁█▁█▁▁▅▁▄▁▁▃▁▃▁▃▁▁▁▁▁▁▁▃▁▄▁▁▁▁▁▁▁▁▁▄▁▁▅▁▅▁▃ █
22.9 ns Histogram: log(frequency) by time 23.9 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Computes in 23 ns. This is so far the fastest I've benchmarked against (StaticArrays, OpenBlas, LoopVectorization)
The new version gives a fully generally form. Still dubious about actually including this here but will leave it open for a bit. It's at least a good reference on how to build on top of the library. The one advantage this approach has on my Mac is the support for Float16
it is twice as fast as any other approach. This version produces almost identical machine code to
function dot(a, b)
s = zero(eltype(a))
@simd for i ∈ eachindex(a, b)
s += conj(a[i]) * b[i]
end
s
end
Of course it is clearer on what can be done and could be optimized further. Anyway, the main advantage is for Float16
and ComplexF16
values which can provide very large improvements. This process is dominated by memory operations so It really isn't a great example of improving speeds too much. Any kind of optimization won't make too large of a speed increase. Might be more interesting in a more computation dense problem like matrix multiplication but I'm not super interested in that case.
Motivated a little bit by the question in slack on computing dot products of complex vectors. I don't really want this library to be a blas type of thing as I'd like to keep it lightweight and support the needs of Bessels but I also understand that this library is now very well suited for these type of things and I think the only SIMD library to explicitly support complex SIMD vectors.
Anyway, it is of course exceptionally easy to beat the base linear algebra routines..
So 1.8x faster for complex dot products and 5x faster for real dot products. And this is just on my apple to support 2x wide simd.
It might be better to also look at Float32 to fit more numbers in...
So now ComplexF32 is 3.2x faster. This should scale better AVX and AVX-512 of course to see much more speed ups. But I'll have to adjust the SIMD width a little bit. I'm away from my AVX computer so Ill have to do that later.
This also beats BLAS pretty convincingly as well. Of course we are unrolling everything so that helps beat it.