Closed klauspost closed 8 years ago
Thanks for the PR.
Personally I'm nervous adding more assembly. I don't understand it (even when I can follow the instructions, I don't know the pitfalls enough to do a good code review), and neither will many of our users. One of the aspirations of the project is to have a legible stack. It is more bug-prone than Go, partly due to the legibility, and partly because there are lots of gotchas, especially involving memory alignment and SSE. It's also less stable than Go code, because the rules about assembler can change (and have) over time.
Don't get me wrong -- 3x speed improvement is a lot. But do we have any programs that see a significant speed increase from these routines (and I don't just mean the floats call)? I would like to see ConstDiv64 (just call it Scale64) because it comes up in BLAS frequently. I just think we need to be judicious in our use of ASM.
Thoughts @fhs @kortschak ?
P.S. If you're looking for assembler projects, I'd love to see scaled/axpy/dot get larger vector operations on certain processors (AVX, etc.). I know that requires a few steps, but even defining what those steps are would be useful.
Thanks for taking a look!
especially involving memory alignment and SSE
All loads assume unaligned (MOVUPD and not MOVAPD), which is just as fast as aligned loads on newer platforms.
I would like to see ConstDiv64 (just call it Scale64) because it comes up in BLAS frequently.
Just note that ConstMul64 is out[x] = in[x] * c
(of course order isn't important) and ConstDiv64 is out[x] = c / in[x]
, where the order obviously matters.
I don't know the pitfalls enough to do a good code review.
Testing is basically the same:
I'd love to see scaled/axpy/dot get larger vector operations
Yes, they seem trivial to convert. Most likely AVX will not give a big speedup over SSE, since they will likely be limited by memory speed.
This is code I have written for another project, so I thought that would be an obvious place to start.
I agree with @btracey. It's better to keep the amount of assembly code to minimal. The code will need to be maintained over time.
That said, I don't think this internal package is the right place for these functions. They belong in https://github.com/gonum/floats This repo contains things other gonum repos uses. It's not meant to provide a public API, and it's not a place to dump all assembly code.
ok, should I:
a) move it to floats an make a PR there b) write tests for this c) drop this PR altogether
I will look and see if the existing assembler could be made faster. I would expect that the "Unitary" could be made faster, since they have a strict dependency chain. Which benchmarks can I use to test them the best?
I would like to see the functionality of ConstMul64. It should be named Dscal, and should have a Unitary and an Inc (like Ddot and Daxpy). This matches the BLAS-like intent of the function.
In floats, we already have: Add, Sub, Mul Min, ConstAdd, AddScaled, MinElement, MaxElement, and Sum.
I don't tihnk we want Sqrt or Abs. They are easily accomplished with a for-loop, and they treat the slice as a collection of numbers rather than a vector.
That leaves Max, and Min. ConstDiv feels to me like Sqrt and Abs. Are max and min that useful? I'm not sure I've ever used that functionality.
What is the current state and interest in this PR? I am interested.
Recently I needed mat64.Vector
which led me to take a closer look at it from the user point of view which led me to add some methods and try to improve their performance. The most useful method is Vector.AddScaledVec
, i.e., axpy with a destination. I modified the current DaxpyInc
assembler function to take a destination slice and wrote a benchmark against the current matrix master:
AddScaledVec10Inc1 12.6ns ± 0% 13.2ns ± 0% +4.76% (p=0.029 n=4+4)
AddScaledVec100Inc1 52.5ns ± 0% 43.4ns ± 0% -17.43% (p=0.029 n=4+4)
AddScaledVec1000Inc1 310ns ± 0% 269ns ± 0% -13.23% (p=0.029 n=4+4)
AddScaledVec10000Inc1 4.99µs ± 0% 5.11µs ± 0% +2.36% (p=0.029 n=4+4)
AddScaledVec100000Inc1 69.9µs ± 0% 70.0µs ± 0% +0.07% (p=0.029 n=4+4)
AddScaledVec10Inc2 55.6ns ± 0% 20.2ns ± 0% -63.67% (p=0.029 n=4+4)
AddScaledVec100Inc2 238ns ± 0% 114ns ± 0% -52.05% (p=0.029 n=4+4)
AddScaledVec1000Inc2 1.92µs ± 0% 1.05µs ± 0% -45.25% (p=0.029 n=4+4)
AddScaledVec10000Inc2 19.7µs ± 0% 12.5µs ± 0% -36.41% (p=0.029 n=4+4)
AddScaledVec100000Inc2 225µs ± 0% 125µs ± 0% -44.26% (p=0.029 n=4+4)
AddScaledVec10Inc20 55.6ns ± 0% 20.6ns ± 0% -62.95% (p=0.029 n=4+4)
AddScaledVec100Inc20 237ns ± 0% 114ns ± 0% -51.95% (p=0.029 n=4+4)
AddScaledVec1000Inc20 2.96µs ± 0% 2.44µs ± 0% -17.59% (p=0.029 n=4+4)
AddScaledVec10000Inc20 44.3µs ± 0% 40.6µs ± 0% -8.48% (p=0.029 n=4+4)
AddScaledVec100000Inc20 1.68ms ± 0% 1.41ms ± 0% -15.70% (p=0.029 n=4+4)
The Inc1 benchmark numbers are not relevant, I did not change DaxpyUnitary
. The other numbers on the other hand are nice.
Therefore I would vote for having assembler routines for amd64 because they do provide a significant speedup. However, I would limit them to functions that directly correspond either to BLAS1 functions or Vector
methods. floats could benefit from them as well, but not as a primary target.
Supporting BLAS and Vector
means that the functions have to take increments. If the increment is 1 (a common case), the increment can be omitted which can sometimes lead to more efficient code.
Supporting Vector
also means that the functions have to take a destination slice. In total this leads to 4 combinations (with/without dest, with/without general increment). If we want to minimize assembly and simplify (internal) API, [with dest, with general increment] is the way at the price of losing some performance in a common unit increment case. A reasonable compromise would be to have Unitary and Inc versions for each operation (my preference).
These vector operations are simple and the corresponding assembler is also simple (I could even imagine generating it automatically), the rules do not seem to have changed over the years: vectorize and unroll. There is not much to be afraid of and @klauspost 's code is clean. If he is still around, I do have two questions: 1) The current DaxpyUnitary
unrolls 2x, his code unrolls 4x. Is there any difference in performance? 2) If there is, is it kept if non-unitary increments are used (e.g., for Daxpy there would be three slices, three increments, three offsets to maintain)?
@gonum/developers ?
PS: At the very least, the current DaxpyInc should be extended to take a destination slice to make it useful for Vector.
The other numbers on the other hand are nice.
I forgot to finish: The other numbers on the other hand are nice and compare two blas64 calls (Copy and Axpy) to asm.DaxpyInc
(at the bottom of Vector.AddScaledVec
).
Therefore I would vote for having assembler routines for amd64 because they do provide a significant speedup. However, I would limit them to functions that directly correspond either to BLAS1 functions or
Vector
methods. floats could benefit from them as well, but not as a primary target.
I'm happy with this suggestion.
As an addendum to that rule, floats should use the assembly functions we get for free. That is, if the function is important enough to have an assembly version for vector, floats should also call it.
At present, I do stand by my initial statement. (generalized) Daxpy , Dot, and Scale are fundamental kernels in BLAS level 2 and 3 (and thus lapack and thus matrix). The rest of them make only sparse appearance. Dnrm2 is clearly useful, but it's not any sort of kernel, even in somewhere like optimize that uses it a lot. Possibly Drot, but I doubt it's on any critical path.
Because of this, these three functions are clearly useful. They provide an immediate speed boost to "real" programs of 2-3x, and were we to extend the assembly to AVX it would be even more. To me this merits the cost of assembly. I'm sure assembly can speed up calling Dasum by a similar amount, but it's not used frequently enough to matter even for linear algebra benchmarks.
Because the point is speed, it's worth having the extra complexity of unitary and inc versions.
Based on the above discussion, Daxpy
, Ddot
, and Dscal
functions with corresponding tests and benchmarks were implemented in PRs #9, #10, #14, #15, #16, #18, #19, #20, #21. Therefore I will close this PR. Thanks, @klauspost
This adds assembler functions for common single and double precision operations. The assembler functions usually are more than twice as fast as pure Go operations. The speedups listed in Xslice.go is on a slice with 10000 elements.
If you think we should move on with this, I will add tests and license, and whatever you think also may be missing.