Closed fhs closed 9 years ago
I get this with 1.4:
$ benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkDdotSmallBothUnitary 23.1 21.6 -6.49%
BenchmarkDdotSmallIncUni 36.3 29.5 -18.73%
BenchmarkDdotSmallUniInc 44.9 28.0 -37.64%
BenchmarkDdotSmallBothInc 37.8 27.6 -26.98%
BenchmarkDdotMediumBothUnitary 1627 627 -61.46%
BenchmarkDdotMediumIncUni 2344 1647 -29.74%
BenchmarkDdotMediumUniInc 2478 1618 -34.71%
BenchmarkDdotMediumBothInc 2504 1945 -22.32%
BenchmarkDdotLargeBothUnitary 186890 84719 -54.67%
BenchmarkDdotLargeIncUni 713870 634351 -11.14%
BenchmarkDdotLargeUniInc 433818 317105 -26.90%
BenchmarkDdotLargeBothInc 935882 841664 -10.07%
BenchmarkDdotHugeBothUnitary 27035479 22361410 -17.29%
BenchmarkDdotHugeIncUni 72098861 65191729 -9.58%
BenchmarkDdotHugeUniInc 48003429 50984946 +6.21%
BenchmarkDdotHugeBothInc 86445132 91902762 +6.31%
I can't see why the release test didn't build on travis - textflag.h should be present.
Would you add license headers?
... and with go version devel +a877e81.
$ benchcmp old.txt new.txt
benchmark old ns/op new ns/op delta
BenchmarkDdotSmallBothUnitary 22.2 20.6 -7.21%
BenchmarkDdotSmallIncUni 34.9 27.4 -21.49%
BenchmarkDdotSmallUniInc 35.3 27.5 -22.10%
BenchmarkDdotSmallBothInc 35.4 27.3 -22.88%
BenchmarkDdotMediumBothUnitary 1600 620 -61.25%
BenchmarkDdotMediumIncUni 2320 1630 -29.74%
BenchmarkDdotMediumUniInc 2400 1603 -33.21%
BenchmarkDdotMediumBothInc 2296 1946 -15.24%
BenchmarkDdotLargeBothUnitary 162103 81675 -49.62%
BenchmarkDdotLargeIncUni 626682 583883 -6.83%
BenchmarkDdotLargeUniInc 335803 298583 -11.08%
BenchmarkDdotLargeBothInc 848481 825154 -2.75%
BenchmarkDdotHugeBothUnitary 23342114 21972418 -5.87%
BenchmarkDdotHugeIncUni 66768363 63730296 -4.55%
BenchmarkDdotHugeUniInc 45670037 44210559 -3.20%
BenchmarkDdotHugeBothInc 85100256 83724079 -1.62%
The problem is that the travis-ci release uses Go 1.3.3, while the tip uses Go 1.4. If you look at the build details, you'll see that tip passed, but release failed. I switched my computer to 1.3.3 and tried it out and I get the same error as the travis build.
edit: in the past we've tried to keep support for older versions of Go, and 1.3.3 is very recent. Is there a way to get this working in that version?
Thanks for putting this together. A couple of high-level comments: 1) I'm not sure what we want to do about codereview/ maintenance. I can sort of tell what's going on, but that's a lot because of the nice comments in the routine. I like fast code as much as anyone, but we need to have people that can maintain the code (for example, to support changes in the runtime).
2) We'll need to increase the test suite, specifically to include cases with wider strides than the matrix and cases where the slices are shadowing one another by various amounts. With assembly coded by ourselves, we can't rely on the fact that the go team has done the right thing
3) Assuming it's maintainable and correct, this could be quite promising. There are a small number of loop kernels (ddot, axpy, and scale at least, but probably not too much more), and it's probably not that hard to recode the inner loops of the higher-level functions to use these. I don't know how much the lack of inline-assembly will hurt, but the go tool may improve at some point.
4) Is it possible to code these in such a way that it is extensible to more than just MULPD? The Intel Knights landing chips, for example, are moving in the direction of larger and larger vectorized units. It would be great if had code that could take advantage of that somehow. What I have in mind would be something like
type SimdSize struct {
length int
mulInstruction string
// etc.
}
We then have a code generator that takes in an array of SimdSizes and generates the assembly. To generate the amd64 code, you would give it only SimdSize{2, "MULPD"} while for knights landing x87 you would give it the instructions for 128, 256, and 512 bit vector sizes. The generator tool would provide the logic for calling those in succession (as well as testing for shadowing, etc.). It seems that the future is headed toward manycore, and the Go language has the constructs to take advantage. It would be nice if we could also take advantage of the additional vectorization. It may be that the assemblies are so different that such an endeavor is too much effort. Just a thought.
Here are my benchmarks:
On 1.4, old vs new
benchmark old ns/op new ns/op delta
BenchmarkDdotSmallBothUnitary 13.2 12.4 -6.06%
BenchmarkDdotSmallIncUni 19.3 17.1 -11.40%
BenchmarkDdotSmallUniInc 18.7 17.2 -8.02%
BenchmarkDdotSmallBothInc 19.1 16.9 -11.52%
BenchmarkDdotMediumBothUnitary 850 417 -50.94%
BenchmarkDdotMediumIncUni 1128 987 -12.50%
BenchmarkDdotMediumUniInc 1124 997 -11.30%
BenchmarkDdotMediumBothInc 1171 1020 -12.89%
BenchmarkDdotLargeBothUnitary 84471 45765 -45.82%
BenchmarkDdotLargeIncUni 171155 143554 -16.13%
BenchmarkDdotLargeUniInc 119249 103142 -13.51%
BenchmarkDdotLargeBothInc 249504 242060 -2.98%
BenchmarkDdotHugeBothUnitary 10760149 9142349 -15.04%
BenchmarkDdotHugeIncUni 27278566 26659861 -2.27%
BenchmarkDdotHugeUniInc 19205312 17377502 -9.52%
BenchmarkDdotHugeBothInc 35331774 35384487 +0.15%
New. txt vs cblas using OSX Accelerate framework.
benchmark old ns/op new ns/op delta
BenchmarkDdotSmallBothUnitary 12.4 196 +1480.65%
BenchmarkDdotSmallIncUni 17.1 201 +1075.44%
BenchmarkDdotSmallUniInc 17.2 200 +1062.79%
BenchmarkDdotSmallBothInc 16.9 200 +1083.43%
BenchmarkDdotMediumBothUnitary 417 340 -18.47%
BenchmarkDdotMediumIncUni 987 866 -12.26%
BenchmarkDdotMediumUniInc 997 774 -22.37%
BenchmarkDdotMediumBothInc 1020 1013 -0.69%
BenchmarkDdotLargeBothUnitary 45765 49874 +8.98%
BenchmarkDdotLargeIncUni 143554 147430 +2.70%
BenchmarkDdotLargeUniInc 103142 87223 -15.43%
BenchmarkDdotLargeBothInc 242060 232011 -4.15%
BenchmarkDdotHugeBothUnitary 9142349 9228276 +0.94%
BenchmarkDdotHugeIncUni 26659861 28009570 +5.06%
BenchmarkDdotHugeUniInc 17377502 17337081 -0.23%
BenchmarkDdotHugeBothInc 35384487 34871204 -1.45%
Accelerate is still doing something better (especially considering the cgo overhead), but this eliminates much of the gap.
Atom 510, x86_64:
benchmark old ns/op new ns/op delta
BenchmarkDdotSmallBothUnitary 139 113 -18.71%
BenchmarkDdotSmallIncUni 180 148 -17.78%
BenchmarkDdotSmallUniInc 201 159 -20.90%
BenchmarkDdotSmallBothInc 189 158 -16.40%
BenchmarkDdotMediumBothUnitary 8534 4918 -42.37%
BenchmarkDdotMediumIncUni 15988 9994 -37.49%
BenchmarkDdotMediumUniInc 16055 9446 -41.16%
BenchmarkDdotMediumBothInc 18266 12412 -32.05%
BenchmarkDdotLargeBothUnitary 1020829 646607 -36.66%
BenchmarkDdotLargeIncUni 1793379 1307234 -27.11%
BenchmarkDdotLargeUniInc 1746168 1135265 -34.99%
BenchmarkDdotLargeBothInc 2114698 1673166 -20.88%
BenchmarkDdotHugeBothUnitary 101451737 64170337 -36.75%
BenchmarkDdotHugeIncUni 178849842 129829209 -27.41%
BenchmarkDdotHugeUniInc 173484821 111482658 -35.74%
BenchmarkDdotHugeBothInc 209515105 168920310 -19.38%
1) With regards to compatibility, it doesn't depend much on the runtime except the layout of slices, 16-byte alignment of slices, and the calling convention. I don't see these things changing any time soon. At the wost case, we'll need different implementation for different Go versions.
I've fixed the build failure in 1.3.
2) It would be nice to have a 386 & arm builder so that alternate implementations gets tested.
4) I don't know much about these CPU features, but I think you can just detect the CPU feature from CPUID and jump to the fastest code. We may need to ask the Go authors to add new instructions if they are not available.
re: 2), we currently use travis-ci for hosted builds, and they don't allow you to change builders (as best as I know). It might be possible on shippable, but I have to do more digging.
The problem is that the travis-ci release uses Go 1.3.3, while the tip uses Go 1.4.
If that's the case it's at odds with the normal meaning of those words - release is currently go1.4 and tip is golang/go master. Can we be explicit rather than using incorrect terms?
This is a question born out of ignorance, but why do you convert to uintptr rather than leaving as an int? The size of int is known since the asm is only used on amd64. Is it just defensive?
This is a question born out of ignorance, but why do you convert to uintptr rather than leaving as an int? The size of int is known since the asm is only used on amd64. Is it just defensive?
Yes, just being defensive. int used to be 4-bytes long on amd64 if I remember correctly.
Yes, it was a long time ago.
Do we want to support GOARCH=amd64p32?
Do we want to support GOARCH=amd64p32?
amd64p32 is for nacl, and it uses 32-bit pointers. I'm afraid it requires more work than simply reusing the .s file for amd64.
If that's the case it's at odds with the normal meaning of those words - release is currently go1.4 and tip is golang/go master. Can we be explicit rather than using incorrect terms?
You're right, I've raised an issue with travis: https://github.com/travis-ci/travis-ci/issues/3076
Also, do you want me to add explicit versions to the travis build matrix? We wouldn't have noticed that this would fail on 1.3.3 if it wasn't still the release version there.
@fhs, no worries. We can add explicit asm support later if people ask (presumably it involves offset size changes that might me mechanisable).
LGTM, but wait for @btracey. Would you send a PR to add yourself to gonum/license AUTHORS/CONTRIBUTORS please.
@jonlawlor, I would prefer to see an explicit list of versions - go1.3 would be in that list since it's the lowest version we support. The .travis.yaml should essentially be a statement of supported versions.
@jonlawlor what failure are you getting? Works fine for me on 1.3.3 on OSX
Sorry, I missed the fix.
LGTM.
@kortschak, ok, I'll change them to include 1.3.3, release, and tip. It won't change anything currently building.
Thank you for doing this.
Thanks for all the feedback.
Benchmark using Go 1.4 release on Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz:
On my laptop with Go 1.5-dev and Intel(R) Core(TM) i5-4330M CPU @ 2.80GHz I got these numbers (they are actually not very stable because of CPU scaling):
I would be interested in what numbers other people get! Run: