Closed btracey closed 10 years ago
Will we see some benchmarks to compare the improvements?
Yea, I can put some together. On my machine with informal benchmarking, it was approximately the same speed for small cases, and roughly linear speedup (NumCPU == 4) for large cases. There are some weird things, like if you're multiplying a 2x N * Nx2, the current implementation is serial, but I couldn't think of a clean way to implement something smarter that didn't create O(mn) garbage. Something to fix in the future I suppose.
Also, anecdotally, my BFGS optimization case is faster than it was before (almost all of the time was spent in matrix operations for my larger cases).
PTAL
Pushed up some benchmarks. The new serial implementation is the same as the old one (and no concurrency is used if GOMAXPROCS == 1), so it's a fair comparison. See between 2 and 3 times speedup on my 4 core laptop.
Data: BenchmarkDgemmSmSmSm 200000 9336 ns/op BenchmarkDgemmSmSmSm-2 200000 9400 ns/op BenchmarkDgemmSmSmSm-4 200000 9383 ns/op BenchmarkDgemmMedMedMed 200 8257148 ns/op BenchmarkDgemmMedMedMed-2 500 4659084 ns/op BenchmarkDgemmMedMedMed-4 500 3649445 ns/op BenchmarkDgemmLgLgLg 1 8379012845 ns/op BenchmarkDgemmLgLgLg-2 1 4273214376 ns/op BenchmarkDgemmLgLgLg-4 1 3118850873 ns/op BenchmarkDgemmLgSmLg 20 99298649 ns/op BenchmarkDgemmLgSmLg-2 50 50398952 ns/op BenchmarkDgemmLgSmLg-4 50 35193893 ns/op BenchmarkDgemmLgLgSm 20 97152463 ns/op BenchmarkDgemmLgLgSm-2 50 49572288 ns/op BenchmarkDgemmLgLgSm-4 50 37355066 ns/op BenchmarkDgemmHgHgSm 1 10374010310 ns/op BenchmarkDgemmHgHgSm-2 1 5129627645 ns/op BenchmarkDgemmHgHgSm-4 1 3450953345 ns/op BenchmarkDgemmMedMedMedTNT 200 8433412 ns/op BenchmarkDgemmMedMedMedTNT-2 500 5147551 ns/op BenchmarkDgemmMedMedMedTNT-4 500 3847916 ns/op BenchmarkDgemmMedMedMedNTT 100 12741118 ns/op BenchmarkDgemmMedMedMedNTT-2 500 6583197 ns/op BenchmarkDgemmMedMedMedNTT-4 500 5331153 ns/op BenchmarkDgemmMedMedMedNTNT 200 8393252 ns/op BenchmarkDgemmMedMedMedNTNT-2 500 4818718 ns/op BenchmarkDgemmMedMedMedNTNT-4 500 3998515 ns/op
Wow. There is an enormous speedup by cleaning up the center loop a bit. I can either make changes of this style in this PR while we're at it, or I can submit them in another request. The code is uglier, but an 80% speedup is too good to pass up.
Change: In the NoTransNoTrans loop, replace the middle with
for i := 0; i < a.rows; i++ {
atmp := a.data[i*a.stride : i*a.stride+a.cols]
for l, v := range atmp {
tmp := alpha * v
if tmp != 0 {
btmp := b.data[l*b.stride : l*b.stride+b.cols]
ctmp := c.data[i*c.stride:]
for j, w := range btmp {
ctmp[j] += tmp * w
}
}
}
}
BenchmarkDgemmSmSmSm 8848 2350 -73.44%
BenchmarkDgemmSmSmSm-2 9168 2412 -73.69%
BenchmarkDgemmSmSmSm-4 9503 2396 -74.79%
BenchmarkDgemmMedMedMed 7925222 1524355 -80.77%
BenchmarkDgemmMedMedMed-2 4450811 906645 -79.63%
BenchmarkDgemmMedMedMed-4 2956430 545959 -81.53%
BenchmarkDgemmLgLgLg 8321178293 1553491458 -81.33%
BenchmarkDgemmLgLgLg-2 4096481733 781449199 -80.92%
BenchmarkDgemmLgLgLg-4 2774749540 513152320 -81.51%
BenchmarkDgemmLgSmLg 91990591 19434236 -78.87%
BenchmarkDgemmLgSmLg-2 46119178 9843512 -78.66%
BenchmarkDgemmLgSmLg-4 27719225 6350394 -77.09%
BenchmarkDgemmLgLgSm 92230142 19371202 -79.00%
BenchmarkDgemmLgLgSm-2 46044000 10108361 -78.05%
BenchmarkDgemmLgLgSm-4 29583221 6224908 -78.96%
BenchmarkDgemmHgHgSm 9507039587 2208681247 -76.77%
BenchmarkDgemmHgHgSm-2 4761512112 1100965519 -76.88%
BenchmarkDgemmHgHgSm-4 3173900203 704244021 -77.81%
BenchmarkDgemmMedMedMedTNT 8100408 8074825 -0.32%
BenchmarkDgemmMedMedMedTNT-2 4744852 4753853 +0.19%
BenchmarkDgemmMedMedMedTNT-4 3537085 3536305 -0.02%
BenchmarkDgemmMedMedMedNTT 11527785 11404634 -1.07%
BenchmarkDgemmMedMedMedNTT-2 5956448 6231677 +4.62%
BenchmarkDgemmMedMedMedNTT-4 4322916 4395229 +1.67%
BenchmarkDgemmMedMedMedNTNT 7998016 1545490 -80.68%
BenchmarkDgemmMedMedMedNTNT-2 4583640 930118 -79.71%
BenchmarkDgemmMedMedMedNTNT-4 3435744 619345 -81.97%
Of course, there's still quite a bit to go before matching c. This is a comparison with the OSX accelerate framework vs. the fast Go implementation above.
benchmark old ns/op new ns/op delta
BenchmarkDgemmSmSmSm 2350 411 -82.51%
BenchmarkDgemmSmSmSm-2 2412 410 -83.00%
BenchmarkDgemmSmSmSm-4 2396 419 -82.51%
BenchmarkDgemmMedMedMed 1524355 92046 -93.96%
BenchmarkDgemmMedMedMed-2 906645 90606 -90.01%
BenchmarkDgemmMedMedMed-4 545959 91317 -83.27%
BenchmarkDgemmLgLgLg 1553491458 34870539 -97.76%
BenchmarkDgemmLgLgLg-2 781449199 33852119 -95.67%
BenchmarkDgemmLgLgLg-4 513152320 34142406 -93.35%
BenchmarkDgemmLgSmLg 19434236 665667 -96.57%
BenchmarkDgemmLgSmLg-2 9843512 675901 -93.13%
BenchmarkDgemmLgSmLg-4 6350394 634556 -90.01%
BenchmarkDgemmLgLgSm 19371202 650770 -96.64%
BenchmarkDgemmLgLgSm-2 10108361 644214 -93.63%
BenchmarkDgemmLgLgSm-4 6224908 651011 -89.54%
BenchmarkDgemmHgHgSm 2208681247 168102317 -92.39%
BenchmarkDgemmHgHgSm-2 1100965519 185098678 -83.19%
BenchmarkDgemmHgHgSm-4 704244021 168688949 -76.05%
BenchmarkDgemmMedMedMedTNT 8074825 96711 -98.80%
BenchmarkDgemmMedMedMedTNT-2 4753853 89994 -98.11%
BenchmarkDgemmMedMedMedTNT-4 3536305 89848 -97.46%
BenchmarkDgemmMedMedMedNTT 11404634 90931 -99.20%
BenchmarkDgemmMedMedMedNTT-2 6231677 91002 -98.54%
BenchmarkDgemmMedMedMedNTT-4 4395229 91713 -97.91%
BenchmarkDgemmMedMedMedNTNT 1545490 90069 -94.17%
BenchmarkDgemmMedMedMedNTNT-2 930118 90709 -90.25%
BenchmarkDgemmMedMedMedNTNT-4 619345 90158 -85.44%
Clarification: In the benchmarks above, any of the cases involving transposes (so the ones with NTT, TT, or TNT) did not have the fast kernel.
Last point of note: bounds checking makes the code about 30% faster. This puts Go between 75 and 90 percent of C (in serial). If the code were vectorized, it would be about 75% faster, so we'd still be behind a bit.
That code doesn't look particularly ugly, but you can elide atmp and btmp into the range statements.
SGTM
Do you think the code is legible enough to just have that, or should I leave the original in a block comment? I think the original code looks more "matrix-like" and so it might be nice to have as a reference.
I think it's fine, but you could add a comment above - not code though, just a description of approach and why it was chosen over the naive approach that is there now.
PTAL. Pushed fast inner loops, and made a small change to the concurrency as it appeared faster with benchmarks.
One more piece of data. The OSX framework is apparently threaded by default. Here is the serial comparison when VECLIB_MAXIMUM_THREADS is set to 1.
benchmark old ns/op new ns/op delta
BenchmarkDgemmSmSmSm 1754 427 -75.66%
BenchmarkDgemmMedMedMed 1017541 94900 -90.67%
BenchmarkDgemmLgLgLg 993933225 81506558 -91.80%
BenchmarkDgemmLgSmLg 11883287 1796243 -84.88%
BenchmarkDgemmLgLgSm 12002537 1767330 -85.28%
BenchmarkDgemmHgHgSm 1174902040 276808425 -76.44%
BenchmarkDgemmMedMedMedTNT 1197057 94367 -92.12%
BenchmarkDgemmMedMedMedNTT 1006570 95186 -90.54%
BenchmarkDgemmMedMedMedNTNT 1020169 94615 -90.73%
Ping. Is this okay to push?
LGTM
Thanks for the beneficial review.
Matrix multiply is a core algorithm for many advanced matrix operations. Concurrency is a core feature of Go. This PR implements a parallel matrix multiply which concurrently multiplies sub-blocks of matrices. If the matrices are sufficiently small, or GOMAXPROCS == 1, the matrices are multiplied in serial to avoid the extra overhead.
This also fixes a bug in the old Dgemm implementation where the whole C matrix data is scaled even if the data is not accessible.
This PR also adds a number of tests verifying the implementation is correct.