goblas: optimize Ddot using SSE2 on amd64

fhs commented 9 years ago

Benchmark using Go 1.4 release on Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz:

benchmark                          old ns/op     new ns/op     delta
BenchmarkDdotSmallBothUnitary      18.5          14.9          -19.46%
BenchmarkDdotSmallIncUni           23.4          21.5          -8.12%
BenchmarkDdotSmallUniInc           23.4          22.2          -5.13%
BenchmarkDdotSmallBothInc          23.1          21.4          -7.36%
BenchmarkDdotMediumBothUnitary     1344          504           -62.50%
BenchmarkDdotMediumIncUni          1690          1187          -29.76%
BenchmarkDdotMediumUniInc          1686          1177          -30.19%
BenchmarkDdotMediumBothInc         1717          1210          -29.53%
BenchmarkDdotLargeBothUnitary      134869        59336         -56.00%
BenchmarkDdotLargeIncUni           223717        163716        -26.82%
BenchmarkDdotLargeUniInc           189230        141713        -25.11%
BenchmarkDdotLargeBothInc          242719        203594        -16.12%
BenchmarkDdotHugeBothUnitary       15563100      9585254       -38.41%
BenchmarkDdotHugeIncUni            34440563      31152669      -9.55%
BenchmarkDdotHugeUniInc            26460880      21385974      -19.18%
BenchmarkDdotHugeBothInc           40452938      40136997      -0.78%

On my laptop with Go 1.5-dev and Intel(R) Core(TM) i5-4330M CPU @ 2.80GHz I got these numbers (they are actually not very stable because of CPU scaling):

benchmark                          old ns/op     new ns/op     delta
BenchmarkDdotSmallBothUnitary      12.9          10.1          -21.71%
BenchmarkDdotSmallIncUni           16.5          16.1          -2.42%
BenchmarkDdotSmallUniInc           16.2          16.2          +0.00%
BenchmarkDdotSmallBothInc          16.2          16.0          -1.23%
BenchmarkDdotMediumBothUnitary     869           432           -50.29%
BenchmarkDdotMediumIncUni          1031          1021          -0.97%
BenchmarkDdotMediumUniInc          1027          1015          -1.17%
BenchmarkDdotMediumBothInc         1042          1022          -1.92%
BenchmarkDdotLargeBothUnitary      87156         44838         -48.55%
BenchmarkDdotLargeIncUni           230788        227383        -1.48%
BenchmarkDdotLargeUniInc           138736        129148        -6.91%
BenchmarkDdotLargeBothInc          335562        333651        -0.57%
BenchmarkDdotHugeBothUnitary       11050482      8906654       -19.40%
BenchmarkDdotHugeIncUni            28042141      27658908      -1.37%
BenchmarkDdotHugeUniInc            19012855      17787912      -6.44%
BenchmarkDdotHugeBothInc           34613451      34130483      -1.40%

I would be interested in what numbers other people get! Run:

go test -bench BenchmarkDdot  > old.txt
go test -bench BenchmarkDdot  > new.txt
go get golang.org/x/tools/cmd/benchcmp
benchcmp old.txt new.txt

kortschak commented 9 years ago

I get this with 1.4:

$ benchcmp old.txt new.txt 
benchmark                          old ns/op     new ns/op     delta
BenchmarkDdotSmallBothUnitary      23.1          21.6          -6.49%
BenchmarkDdotSmallIncUni           36.3          29.5          -18.73%
BenchmarkDdotSmallUniInc           44.9          28.0          -37.64%
BenchmarkDdotSmallBothInc          37.8          27.6          -26.98%
BenchmarkDdotMediumBothUnitary     1627          627           -61.46%
BenchmarkDdotMediumIncUni          2344          1647          -29.74%
BenchmarkDdotMediumUniInc          2478          1618          -34.71%
BenchmarkDdotMediumBothInc         2504          1945          -22.32%
BenchmarkDdotLargeBothUnitary      186890        84719         -54.67%
BenchmarkDdotLargeIncUni           713870        634351        -11.14%
BenchmarkDdotLargeUniInc           433818        317105        -26.90%
BenchmarkDdotLargeBothInc          935882        841664        -10.07%
BenchmarkDdotHugeBothUnitary       27035479      22361410      -17.29%
BenchmarkDdotHugeIncUni            72098861      65191729      -9.58%
BenchmarkDdotHugeUniInc            48003429      50984946      +6.21%
BenchmarkDdotHugeBothInc           86445132      91902762      +6.31%

I can't see why the release test didn't build on travis - textflag.h should be present.

Would you add license headers?

kortschak commented 9 years ago

... and with go version devel +a877e81.

$ benchcmp old.txt new.txt 
benchmark                          old ns/op     new ns/op     delta
BenchmarkDdotSmallBothUnitary      22.2          20.6          -7.21%
BenchmarkDdotSmallIncUni           34.9          27.4          -21.49%
BenchmarkDdotSmallUniInc           35.3          27.5          -22.10%
BenchmarkDdotSmallBothInc          35.4          27.3          -22.88%
BenchmarkDdotMediumBothUnitary     1600          620           -61.25%
BenchmarkDdotMediumIncUni          2320          1630          -29.74%
BenchmarkDdotMediumUniInc          2400          1603          -33.21%
BenchmarkDdotMediumBothInc         2296          1946          -15.24%
BenchmarkDdotLargeBothUnitary      162103        81675         -49.62%
BenchmarkDdotLargeIncUni           626682        583883        -6.83%
BenchmarkDdotLargeUniInc           335803        298583        -11.08%
BenchmarkDdotLargeBothInc          848481        825154        -2.75%
BenchmarkDdotHugeBothUnitary       23342114      21972418      -5.87%
BenchmarkDdotHugeIncUni            66768363      63730296      -4.55%
BenchmarkDdotHugeUniInc            45670037      44210559      -3.20%
BenchmarkDdotHugeBothInc           85100256      83724079      -1.62%

jonlawlor commented 9 years ago

The problem is that the travis-ci release uses Go 1.3.3, while the tip uses Go 1.4. If you look at the build details, you'll see that tip passed, but release failed. I switched my computer to 1.3.3 and tried it out and I get the same error as the travis build.

edit: in the past we've tried to keep support for older versions of Go, and 1.3.3 is very recent. Is there a way to get this working in that version?

btracey commented 9 years ago

Thanks for putting this together. A couple of high-level comments: 1) I'm not sure what we want to do about codereview/ maintenance. I can sort of tell what's going on, but that's a lot because of the nice comments in the routine. I like fast code as much as anyone, but we need to have people that can maintain the code (for example, to support changes in the runtime).

2) We'll need to increase the test suite, specifically to include cases with wider strides than the matrix and cases where the slices are shadowing one another by various amounts. With assembly coded by ourselves, we can't rely on the fact that the go team has done the right thing

3) Assuming it's maintainable and correct, this could be quite promising. There are a small number of loop kernels (ddot, axpy, and scale at least, but probably not too much more), and it's probably not that hard to recode the inner loops of the higher-level functions to use these. I don't know how much the lack of inline-assembly will hurt, but the go tool may improve at some point.

4) Is it possible to code these in such a way that it is extensible to more than just MULPD? The Intel Knights landing chips, for example, are moving in the direction of larger and larger vectorized units. It would be great if had code that could take advantage of that somehow. What I have in mind would be something like

type SimdSize struct {
     length int
     mulInstruction string
     // etc.
}

We then have a code generator that takes in an array of SimdSizes and generates the assembly. To generate the amd64 code, you would give it only SimdSize{2, "MULPD"} while for knights landing x87 you would give it the instructions for 128, 256, and 512 bit vector sizes. The generator tool would provide the logic for calling those in succession (as well as testing for shadowing, etc.). It seems that the future is headed toward manycore, and the Go language has the constructs to take advantage. It would be nice if we could also take advantage of the additional vectorization. It may be that the assemblies are so different that such an endeavor is too much effort. Just a thought.

btracey commented 9 years ago

Here are my benchmarks:

On 1.4, old vs new

benchmark                          old ns/op     new ns/op     delta       
BenchmarkDdotSmallBothUnitary      13.2          12.4          -6.06%      
BenchmarkDdotSmallIncUni           19.3          17.1          -11.40%     
BenchmarkDdotSmallUniInc           18.7          17.2          -8.02%      
BenchmarkDdotSmallBothInc          19.1          16.9          -11.52%     
BenchmarkDdotMediumBothUnitary     850           417           -50.94%     
BenchmarkDdotMediumIncUni          1128          987           -12.50%     
BenchmarkDdotMediumUniInc          1124          997           -11.30%     
BenchmarkDdotMediumBothInc         1171          1020          -12.89%     
BenchmarkDdotLargeBothUnitary      84471         45765         -45.82%     
BenchmarkDdotLargeIncUni           171155        143554        -16.13%     
BenchmarkDdotLargeUniInc           119249        103142        -13.51%     
BenchmarkDdotLargeBothInc          249504        242060        -2.98%      
BenchmarkDdotHugeBothUnitary       10760149      9142349       -15.04%     
BenchmarkDdotHugeIncUni            27278566      26659861      -2.27%      
BenchmarkDdotHugeUniInc            19205312      17377502      -9.52%      
BenchmarkDdotHugeBothInc           35331774      35384487      +0.15%

New. txt vs cblas using OSX Accelerate framework.

benchmark                          old ns/op     new ns/op     delta         
BenchmarkDdotSmallBothUnitary      12.4          196           +1480.65%     
BenchmarkDdotSmallIncUni           17.1          201           +1075.44%     
BenchmarkDdotSmallUniInc           17.2          200           +1062.79%     
BenchmarkDdotSmallBothInc          16.9          200           +1083.43%     
BenchmarkDdotMediumBothUnitary     417           340           -18.47%       
BenchmarkDdotMediumIncUni          987           866           -12.26%       
BenchmarkDdotMediumUniInc          997           774           -22.37%       
BenchmarkDdotMediumBothInc         1020          1013          -0.69%        
BenchmarkDdotLargeBothUnitary      45765         49874         +8.98%        
BenchmarkDdotLargeIncUni           143554        147430        +2.70%        
BenchmarkDdotLargeUniInc           103142        87223         -15.43%       
BenchmarkDdotLargeBothInc          242060        232011        -4.15%        
BenchmarkDdotHugeBothUnitary       9142349       9228276       +0.94%        
BenchmarkDdotHugeIncUni            26659861      28009570      +5.06%        
BenchmarkDdotHugeUniInc            17377502      17337081      -0.23%        
BenchmarkDdotHugeBothInc           35384487      34871204      -1.45%

Accelerate is still doing something better (especially considering the cgo overhead), but this eliminates much of the gap.

jonlawlor commented 9 years ago

Atom 510, x86_64:

benchmark                          old ns/op     new ns/op     delta
BenchmarkDdotSmallBothUnitary      139           113           -18.71%
BenchmarkDdotSmallIncUni           180           148           -17.78%
BenchmarkDdotSmallUniInc           201           159           -20.90%
BenchmarkDdotSmallBothInc          189           158           -16.40%
BenchmarkDdotMediumBothUnitary     8534          4918          -42.37%
BenchmarkDdotMediumIncUni          15988         9994          -37.49%
BenchmarkDdotMediumUniInc          16055         9446          -41.16%
BenchmarkDdotMediumBothInc         18266         12412         -32.05%
BenchmarkDdotLargeBothUnitary      1020829       646607        -36.66%
BenchmarkDdotLargeIncUni           1793379       1307234       -27.11%
BenchmarkDdotLargeUniInc           1746168       1135265       -34.99%
BenchmarkDdotLargeBothInc          2114698       1673166       -20.88%
BenchmarkDdotHugeBothUnitary       101451737     64170337      -36.75%
BenchmarkDdotHugeIncUni            178849842     129829209     -27.41%
BenchmarkDdotHugeUniInc            173484821     111482658     -35.74%
BenchmarkDdotHugeBothInc           209515105     168920310     -19.38%

fhs commented 9 years ago

1) With regards to compatibility, it doesn't depend much on the runtime except the layout of slices, 16-byte alignment of slices, and the calling convention. I don't see these things changing any time soon. At the wost case, we'll need different implementation for different Go versions.

I've fixed the build failure in 1.3.

2) It would be nice to have a 386 & arm builder so that alternate implementations gets tested.

4) I don't know much about these CPU features, but I think you can just detect the CPU feature from CPUID and jump to the fastest code. We may need to ask the Go authors to add new instructions if they are not available.

jonlawlor commented 9 years ago

re: 2), we currently use travis-ci for hosted builds, and they don't allow you to change builders (as best as I know). It might be possible on shippable, but I have to do more digging.

kortschak commented 9 years ago

The problem is that the travis-ci release uses Go 1.3.3, while the tip uses Go 1.4.

If that's the case it's at odds with the normal meaning of those words - release is currently go1.4 and tip is golang/go master. Can we be explicit rather than using incorrect terms?

kortschak commented 9 years ago

This is a question born out of ignorance, but why do you convert to uintptr rather than leaving as an int? The size of int is known since the asm is only used on amd64. Is it just defensive?

fhs commented 9 years ago

This is a question born out of ignorance, but why do you convert to uintptr rather than leaving as an int? The size of int is known since the asm is only used on amd64. Is it just defensive?

Yes, just being defensive. int used to be 4-bytes long on amd64 if I remember correctly.

kortschak commented 9 years ago

Yes, it was a long time ago.

kortschak commented 9 years ago

Do we want to support GOARCH=amd64p32?

fhs commented 9 years ago

Do we want to support GOARCH=amd64p32?

amd64p32 is for nacl, and it uses 32-bit pointers. I'm afraid it requires more work than simply reusing the .s file for amd64.

jonlawlor commented 9 years ago

If that's the case it's at odds with the normal meaning of those words - release is currently go1.4 and tip is golang/go master. Can we be explicit rather than using incorrect terms?

You're right, I've raised an issue with travis: https://github.com/travis-ci/travis-ci/issues/3076

Also, do you want me to add explicit versions to the travis build matrix? We wouldn't have noticed that this would fail on 1.3.3 if it wasn't still the release version there.

kortschak commented 9 years ago

@fhs, no worries. We can add explicit asm support later if people ask (presumably it involves offset size changes that might me mechanisable).

LGTM, but wait for @btracey. Would you send a PR to add yourself to gonum/license AUTHORS/CONTRIBUTORS please.

@jonlawlor, I would prefer to see an explicit list of versions - go1.3 would be in that list since it's the lowest version we support. The .travis.yaml should essentially be a statement of supported versions.

btracey commented 9 years ago

@jonlawlor what failure are you getting? Works fine for me on 1.3.3 on OSX

btracey commented 9 years ago

Sorry, I missed the fix.

LGTM.

jonlawlor commented 9 years ago

@kortschak, ok, I'll change them to include 1.3.3, release, and tip. It won't change anything currently building.

kortschak commented 9 years ago

Thank you for doing this.

fhs commented 9 years ago

Thanks for all the feedback.

gonum / blas

goblas: optimize Ddot using SSE2 on amd64 #64