gonum / blas

A BLAS implementation for Go [DEPRECATED]
172 stars 16 forks source link

Simplify dgemm to use fewer goroutines and fewer messages. #22

Closed btracey closed 10 years ago

btracey commented 10 years ago

The old code was wasteful. It sent 2 messages every block computation, and it launched a new goroutine at each block computation. This was done to synchronize the computations along the k dimension, but this is silly, because the k dimension is being computed in serial anyway. This refactors the code such that the k dimension is computed in serial by a worker before a new {i,j} pair is received. This reduced the messages sent, reduces the number of goroutines, and should help cache coherency. This seems to improve performance somewhat, but even if the performance gains were negligible, the simplification makes it worth it. When the inner loop is sped up (bounds checking, vectorization), this code will almost certainly be faster.

BenchmarkDgemmSmSmSm 2372 2313 -2.49% BenchmarkDgemmSmSmSm-2 2280 2267 -0.57% BenchmarkDgemmSmSmSm-4 2290 2295 +0.22% BenchmarkDgemmMedMedMed 1302412 1293271 -0.70% BenchmarkDgemmMedMedMed-2 723841 717892 -0.82% BenchmarkDgemmMedMedMed-4 563282 575358 +2.14% BenchmarkDgemmMedLgMed 12601547 12633168 +0.25% BenchmarkDgemmMedLgMed-2 6668578 6727320 +0.88% BenchmarkDgemmMedLgMed-4 4005838 4029760 +0.60% BenchmarkDgemmLgLgLg 1284515732 1317360458 +2.56% BenchmarkDgemmLgLgLg-2 663440326 653124639 -1.55% BenchmarkDgemmLgLgLg-4 442994398 443350897 +0.08% BenchmarkDgemmLgSmLg 15146665 15495804 +2.31% BenchmarkDgemmLgSmLg-2 7894837 7818921 -0.96% BenchmarkDgemmLgSmLg-4 4770521 5022796 +5.29% BenchmarkDgemmLgLgSm 15419348 14768983 -4.22% BenchmarkDgemmLgLgSm-2 7854456 7819836 -0.44% BenchmarkDgemmLgLgSm-4 4833188 4891688 +1.21% BenchmarkDgemmHgHgSm 1542986120 1498077839 -2.91% BenchmarkDgemmHgHgSm-2 770878951 753059418 -2.31% BenchmarkDgemmHgHgSm-4 524997747 503342998 -4.12% BenchmarkDgemmMedMedMedTNT 1293529 1296046 +0.19% BenchmarkDgemmMedMedMedTNT-2 744613 730826 -1.85% BenchmarkDgemmMedMedMedTNT-4 568071 569269 +0.21% BenchmarkDgemmMedMedMedNTT 1047344 1012308 -3.35% BenchmarkDgemmMedMedMedNTT-2 571599 561348 -1.79% BenchmarkDgemmMedMedMedNTT-4 451243 458217 +1.55% BenchmarkDgemmMedMedMedNTNT 1320557 1294146 -2.00% BenchmarkDgemmMedMedMedNTNT-2 721417 710031 -1.58% BenchmarkDgemmMedMedMedNTNT-4 558277 557583 -0.12%

btracey commented 10 years ago

PTAL @kortschak . The overall idea is the same as before, but the k computation is done in serial. This pushes the length checking into the worker rather than the sender loop. It seems to me easier to examine by looking at the file as a whole rather than the diff.

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-0.4%) when pulling e71a9000bbb611a28bc233cb8623eb5ce7ce464c on simplifydgemm into ed65c878b02439c0178be850b8a3d14e1d331241 on master.

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-0.41%) when pulling 7803e089ebb5bafe4b52d9f826c8c3abf90e80f6 on simplifydgemm into ed65c878b02439c0178be850b8a3d14e1d331241 on master.

jonlawlor commented 10 years ago

Those two approaches are functionally equivalent, so I suggest running benchmarks on them.

btracey commented 10 years ago

If you mean the broad architectural change, then the benchmarks are in the commit message. If you mean the cleanup, the difference between the two will be completely lost in the noise On Oct 30, 2014 5:55 AM, "Jonathan J Lawlor" notifications@github.com wrote:

Those two approaches are functionally equivalent, so I suggest running benchmarks on them.

— Reply to this email directly or view it on GitHub https://github.com/gonum/blas/pull/22#issuecomment-61086617.

btracey commented 10 years ago

PTAL

kortschak commented 10 years ago

LGTM after fixing minor comments.