Add AMD64 SSE slice operations

klauspost commented 9 years ago

This adds assembler functions for common single and double precision operations. The assembler functions usually are more than twice as fast as pure Go operations. The speedups listed in Xslice.go is on a slice with 10000 elements.

If you think we should move on with this, I will add tests and license, and whatever you think also may be missing.

btracey commented 9 years ago

Thanks for the PR.

Personally I'm nervous adding more assembly. I don't understand it (even when I can follow the instructions, I don't know the pitfalls enough to do a good code review), and neither will many of our users. One of the aspirations of the project is to have a legible stack. It is more bug-prone than Go, partly due to the legibility, and partly because there are lots of gotchas, especially involving memory alignment and SSE. It's also less stable than Go code, because the rules about assembler can change (and have) over time.

Don't get me wrong -- 3x speed improvement is a lot. But do we have any programs that see a significant speed increase from these routines (and I don't just mean the floats call)? I would like to see ConstDiv64 (just call it Scale64) because it comes up in BLAS frequently. I just think we need to be judicious in our use of ASM.

Thoughts @fhs @kortschak ?

P.S. If you're looking for assembler projects, I'd love to see scaled/axpy/dot get larger vector operations on certain processors (AVX, etc.). I know that requires a few steps, but even defining what those steps are would be useful.

klauspost commented 9 years ago

Thanks for taking a look!

especially involving memory alignment and SSE

All loads assume unaligned (MOVUPD and not MOVAPD), which is just as fast as aligned loads on newer platforms.

I would like to see ConstDiv64 (just call it Scale64) because it comes up in BLAS frequently.

Just note that ConstMul64 is out[x] = in[x] * c (of course order isn't important) and ConstDiv64 is out[x] = c / in[x], where the order obviously matters.

I don't know the pitfalls enough to do a good code review.

Testing is basically the same:

Check if output is correct.
Check if all elements are processed.
Check that we don't modify data after the slice (easy to check with subslicing).

I'd love to see scaled/axpy/dot get larger vector operations

Yes, they seem trivial to convert. Most likely AVX will not give a big speedup over SSE, since they will likely be limited by memory speed.

This is code I have written for another project, so I thought that would be an obvious place to start.

fhs commented 9 years ago

I agree with @btracey. It's better to keep the amount of assembly code to minimal. The code will need to be maintained over time.

That said, I don't think this internal package is the right place for these functions. They belong in https://github.com/gonum/floats This repo contains things other gonum repos uses. It's not meant to provide a public API, and it's not a place to dump all assembly code.

klauspost commented 9 years ago

ok, should I:

a) move it to floats an make a PR there b) write tests for this c) drop this PR altogether

I will look and see if the existing assembler could be made faster. I would expect that the "Unitary" could be made faster, since they have a strict dependency chain. Which benchmarks can I use to test them the best?

btracey commented 9 years ago

I would like to see the functionality of ConstMul64. It should be named Dscal, and should have a Unitary and an Inc (like Ddot and Daxpy). This matches the BLAS-like intent of the function.

In floats, we already have: Add, Sub, Mul Min, ConstAdd, AddScaled, MinElement, MaxElement, and Sum.

I don't tihnk we want Sqrt or Abs. They are easily accomplished with a for-loop, and they treat the slice as a collection of numbers rather than a vector.

That leaves Max, and Min. ConstDiv feels to me like Sqrt and Abs. Are max and min that useful? I'm not sure I've ever used that functionality.

vladimir-ch commented 9 years ago

What is the current state and interest in this PR? I am interested.

Recently I needed mat64.Vector which led me to take a closer look at it from the user point of view which led me to add some methods and try to improve their performance. The most useful method is Vector.AddScaledVec, i.e., axpy with a destination. I modified the current DaxpyInc assembler function to take a destination slice and wrote a benchmark against the current matrix master:

AddScaledVec10Inc1       12.6ns ± 0%  13.2ns ± 0%   +4.76%  (p=0.029 n=4+4)
AddScaledVec100Inc1      52.5ns ± 0%  43.4ns ± 0%  -17.43%  (p=0.029 n=4+4)
AddScaledVec1000Inc1      310ns ± 0%   269ns ± 0%  -13.23%  (p=0.029 n=4+4)
AddScaledVec10000Inc1    4.99µs ± 0%  5.11µs ± 0%   +2.36%  (p=0.029 n=4+4)
AddScaledVec100000Inc1   69.9µs ± 0%  70.0µs ± 0%   +0.07%  (p=0.029 n=4+4)
AddScaledVec10Inc2       55.6ns ± 0%  20.2ns ± 0%  -63.67%  (p=0.029 n=4+4)
AddScaledVec100Inc2       238ns ± 0%   114ns ± 0%  -52.05%  (p=0.029 n=4+4)
AddScaledVec1000Inc2     1.92µs ± 0%  1.05µs ± 0%  -45.25%  (p=0.029 n=4+4)
AddScaledVec10000Inc2    19.7µs ± 0%  12.5µs ± 0%  -36.41%  (p=0.029 n=4+4)
AddScaledVec100000Inc2    225µs ± 0%   125µs ± 0%  -44.26%  (p=0.029 n=4+4)
AddScaledVec10Inc20      55.6ns ± 0%  20.6ns ± 0%  -62.95%  (p=0.029 n=4+4)
AddScaledVec100Inc20      237ns ± 0%   114ns ± 0%  -51.95%  (p=0.029 n=4+4)
AddScaledVec1000Inc20    2.96µs ± 0%  2.44µs ± 0%  -17.59%  (p=0.029 n=4+4)
AddScaledVec10000Inc20   44.3µs ± 0%  40.6µs ± 0%   -8.48%  (p=0.029 n=4+4)
AddScaledVec100000Inc20  1.68ms ± 0%  1.41ms ± 0%  -15.70%  (p=0.029 n=4+4)

The Inc1 benchmark numbers are not relevant, I did not change DaxpyUnitary. The other numbers on the other hand are nice.

Therefore I would vote for having assembler routines for amd64 because they do provide a significant speedup. However, I would limit them to functions that directly correspond either to BLAS1 functions or Vector methods. floats could benefit from them as well, but not as a primary target.

Supporting BLAS and Vector means that the functions have to take increments. If the increment is 1 (a common case), the increment can be omitted which can sometimes lead to more efficient code. Supporting Vector also means that the functions have to take a destination slice. In total this leads to 4 combinations (with/without dest, with/without general increment). If we want to minimize assembly and simplify (internal) API, [with dest, with general increment] is the way at the price of losing some performance in a common unit increment case. A reasonable compromise would be to have Unitary and Inc versions for each operation (my preference).

These vector operations are simple and the corresponding assembler is also simple (I could even imagine generating it automatically), the rules do not seem to have changed over the years: vectorize and unroll. There is not much to be afraid of and @klauspost 's code is clean. If he is still around, I do have two questions: 1) The current DaxpyUnitary unrolls 2x, his code unrolls 4x. Is there any difference in performance? 2) If there is, is it kept if non-unitary increments are used (e.g., for Daxpy there would be three slices, three increments, three offsets to maintain)?

@gonum/developers ?

PS: At the very least, the current DaxpyInc should be extended to take a destination slice to make it useful for Vector.

vladimir-ch commented 9 years ago

The other numbers on the other hand are nice.

I forgot to finish: The other numbers on the other hand are nice and compare two blas64 calls (Copy and Axpy) to asm.DaxpyInc (at the bottom of Vector.AddScaledVec).

kortschak commented 9 years ago

Therefore I would vote for having assembler routines for amd64 because they do provide a significant speedup. However, I would limit them to functions that directly correspond either to BLAS1 functions or Vector methods. floats could benefit from them as well, but not as a primary target.

I'm happy with this suggestion.

btracey commented 9 years ago

As an addendum to that rule, floats should use the assembly functions we get for free. That is, if the function is important enough to have an assembly version for vector, floats should also call it.

At present, I do stand by my initial statement. (generalized) Daxpy , Dot, and Scale are fundamental kernels in BLAS level 2 and 3 (and thus lapack and thus matrix). The rest of them make only sparse appearance. Dnrm2 is clearly useful, but it's not any sort of kernel, even in somewhere like optimize that uses it a lot. Possibly Drot, but I doubt it's on any critical path.

Because of this, these three functions are clearly useful. They provide an immediate speed boost to "real" programs of 2-3x, and were we to extend the assembly to AVX it would be even more. To me this merits the cost of assembly. I'm sure assembly can speed up calling Dasum by a similar amount, but it's not used frequently enough to matter even for linear algebra benchmarks.

Because the point is speed, it's worth having the extra complexity of unitary and inc versions.

vladimir-ch commented 8 years ago

Based on the above discussion, Daxpy, Ddot, and Dscal functions with corresponding tests and benchmarks were implemented in PRs #9, #10, #14, #15, #16, #18, #19, #20, #21. Therefore I will close this PR. Thanks, @klauspost

gonum / internal

Add AMD64 SSE slice operations #7