Block multiplication of matrices.

dbalchev commented 9 years ago

I've wrote a function multiplying matrices by blocks. It should to be faster on larger matrices due to cache related issues of the simple multiplication function. It doesn't change the memory layout of the matrix.

I've tested the performance of both the simple (previous) and block multiplication of square matrices. For block size of 16 (it performed best on my computer) the block multiplication is outperforming the simple multiplication when matrix size is 32x32 and when the matrix size is 256x256 it's twice as fast as simple multiplication. I've copied the bench output at the end of this post.

PS: This is my first pull request, so i doubt it's well-formed. I wasn't sure if I had to raise an issue before making a pullrequest . Also I'm not sure if the code is readable enough.

This is the bench output on my Intel i7-3740QM PC. test matrix::transpose::matrix_transpose::bench::bench_gram_optimized ... bench: 3,174,322 ns/iter (+/- 169,171) test matrix::transpose::matrix_transpose::bench::bench_gram_simple ... bench: 6,123,006 ns/iter (+/- 241,711) test matrix::transpose::matrix_transpose::bench::bench_multiply_0001_block ... bench: 273 ns/iter (+/- 78) test matrix::transpose::matrix_transpose::bench::bench_multiply_0001_simple ... bench: 32 ns/iter (+/- 2) test matrix::transpose::matrix_transpose::bench::bench_multiply_0002_block ... bench: 292 ns/iter (+/- 6) test matrix::transpose::matrix_transpose::bench::bench_multiply_0002_simple ... bench: 45 ns/iter (+/- 7) test matrix::transpose::matrix_transpose::bench::bench_multiply_0004_block ... bench: 378 ns/iter (+/- 11) test matrix::transpose::matrix_transpose::bench::bench_multiply_0004_simple ... bench: 99 ns/iter (+/- 14) test matrix::transpose::matrix_transpose::bench::bench_multiply_0008_block ... bench: 793 ns/iter (+/- 160) test matrix::transpose::matrix_transpose::bench::bench_multiply_0008_simple ... bench: 444 ns/iter (+/- 64) test matrix::transpose::matrix_transpose::bench::bench_multiply_0016_block ... bench: 3,378 ns/iter (+/- 274) test matrix::transpose::matrix_transpose::bench::bench_multiply_0016_simple ... bench: 2,602 ns/iter (+/- 174) test matrix::transpose::matrix_transpose::bench::bench_multiply_0032_block ... bench: 25,345 ns/iter (+/- 1,643) test matrix::transpose::matrix_transpose::bench::bench_multiply_0032_simple ... bench: 28,967 ns/iter (+/- 813) test matrix::transpose::matrix_transpose::bench::bench_multiply_0064_block ... bench: 192,730 ns/iter (+/- 23,921) test matrix::transpose::matrix_transpose::bench::bench_multiply_0064_simple ... bench: 242,750 ns/iter (+/- 4,406) test matrix::transpose::matrix_transpose::bench::bench_multiply_0128_block ... bench: 1,559,674 ns/iter (+/- 92,525) test matrix::transpose::matrix_transpose::bench::bench_multiply_0128_simple ... bench: 2,045,354 ns/iter (+/- 245,310) test matrix::transpose::matrix_transpose::bench::bench_multiply_0256_block ... bench: 12,406,499 ns/iter (+/- 498,650) test matrix::transpose::matrix_transpose::bench::bench_multiply_0256_simple ... bench: 27,055,370 ns/iter (+/- 753,259) test matrix::transpose::matrix_transpose::bench::bench_multiply_0512_block ... bench: 101,763,380 ns/iter (+/- 10,103,632) test matrix::transpose::matrix_transpose::bench::bench_multiply_0512_simple ... bench: 227,811,582 ns/iter (+/- 21,193,979)

shailesh1729 commented 9 years ago

@dbalchev Thanks a lot for the work. We are still in the early phase of the library, hence not filing an issue apriori is not a problem. I am merging this change. If any coding convention/readability related issue is there I will do it later.

BTW, my earlier idea was that do a block by block transpose of the matrix first and then do a column by column multiplication. I had done some testing related to that. See for example in bench_multiply_transpose_simple benchmark. Is your new approach better?

dbalchev commented 9 years ago

I've made a quick bench and copied it in the end of this post. It seems transposing and then multiplying is good for smallish matrices. But for big matrices (larger than say 128-256 elements ) block multiplication is slightly faster.

I think it needs some more research and benchmarks (e.g. for rectangular matrices, for f32 and f64 and other CPU-s) for finding the optimal threshold for simple, transpose-then-multiply, block_multiply.

And maybe I should micro-optimize multiply_block a bit more (more pointers, less slices...).

test matrix::transpose::matrix_transpose::bench::bench_gram_optimized ... bench: 3,125,125 ns/iter (+/- 58,592) test matrix::transpose::matrix_transpose::bench::bench_gram_simple ... bench: 6,206,784 ns/iter (+/- 416,182) test matrix::transpose::matrix_transpose::bench::bench_multiply_0001_block ... bench: 274 ns/iter (+/- 10) test matrix::transpose::matrix_transpose::bench::bench_multiply_0001_transpose_s imple ... bench: 50 ns/iter (+/- 15) test matrix::transpose::matrix_transpose::bench::bench_multiply_0002_block ... bench: 301 ns/iter (+/- 182) test matrix::transpose::matrix_transpose::bench::bench_multiply_0002_transpose_s imple ... bench: 62 ns/iter (+/- 6) test matrix::transpose::matrix_transpose::bench::bench_multiply_0004_block ... bench: 377 ns/iter (+/- 279) test matrix::transpose::matrix_transpose::bench::bench_multiply_0004_transpose_s imple ... bench: 105 ns/iter (+/- 49) test matrix::transpose::matrix_transpose::bench::bench_multiply_0008_block ... bench: 777 ns/iter (+/- 38) test matrix::transpose::matrix_transpose::bench::bench_multiply_0008_transpose_s imple ... bench: 393 ns/iter (+/- 270) test matrix::transpose::matrix_transpose::bench::bench_multiply_0016_block ... bench: 3,349 ns/iter (+/- 835) test matrix::transpose::matrix_transpose::bench::bench_multiply_0016_transpose_s imple ... bench: 2,411 ns/iter (+/- 344) test matrix::transpose::matrix_transpose::bench::bench_multiply_0032_block ... bench: 25,055 ns/iter (+/- 2,710) test matrix::transpose::matrix_transpose::bench::bench_multiply_0032_transpose_s imple ... bench: 20,424 ns/iter (+/- 10,489) test matrix::transpose::matrix_transpose::bench::bench_multiply_0064_block ... bench: 192,521 ns/iter (+/- 30,063) test matrix::transpose::matrix_transpose::bench::bench_multiply_0064_transpose_s imple ... bench: 167,637 ns/iter (+/- 11,311) test matrix::transpose::matrix_transpose::bench::bench_multiply_0128_block ... bench: 1,537,360 ns/iter (+/- 101,314) test matrix::transpose::matrix_transpose::bench::bench_multiply_0128_transpose_s imple ... bench: 1,601,791 ns/iter (+/- 125,868) test matrix::transpose::matrix_transpose::bench::bench_multiply_0256_block ... bench: 12,297,366 ns/iter (+/- 958,284) test matrix::transpose::matrix_transpose::bench::bench_multiply_0256_transpose_simple ... bench: 14,146,692 ns/iter (+/- 420,942) test matrix::transpose::matrix_transpose::bench::bench_multiply_0512_block ... bench: 98,059,444 ns/iter (+/- 7,939,212) test matrix::transpose::matrix_transpose::bench::bench_multiply_0512_transpose_s imple ... bench: 131,954,064 ns/iter (+/- 14,199,473)

shailesh1729 commented 9 years ago

Okay. So having block multiplication is justified for now. Here are some references for further study

https://en.wikipedia.org/wiki/Strassen_algorithm

https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm

Not sure if it's worth the effort to try implementing these. If yes, then we can test them for matrices of pretty large size like 1024 x 1024 or 2048 x 2048.

shailesh1729 commented 9 years ago

This Stanford paper http://cs.stanford.edu/people/boyko/pubs/MatrixMult_SURJ_2004.pdf compares both Strassen and Winograd algorithm and proposes a hybrid algorithm for better performance.

indigits / scirust

Block multiplication of matrices. #19