Quadratic complexity of sparse matrices multiplication

espdev commented 4 years ago

Hello!

I try to use sprs package in my experimental code. I have encountered the issue of very slow multiplication of large sparse matrices.

My code uses large diagonal matrices in CSR format. The matrices might contain from hundreds/thousands to millions NNZ elements. Multiplication of such matrices in sprs has almost quadratic complexity O(nnz^2). So we cannot use sprs in production code currently.

I wrote some tests in Rust+sprs and Python+scipy. SciPy uses SMMP algorithm (Sparse Matrix Multiplication Package) and perform multiplication such matrices in tens/hundreds of milliseconds.

The code of my benchmarks can be found on gist: https://gist.github.com/espdev/d278962828ce7360d653b8679d7bb511

The plot for mat * mat.T: sprs_scipy_matmul

Could you consider using SMMP algorithm in sprs? This algorithm has complexity O(n_row*K^2 + max(n_row,n_col)) where K is the maximum nnz in a row of A and column of B.

rth commented 4 years ago

Thanks for investigating this! I have always wondered how sprs would compare to scipy.sparse performance wise https://github.com/scipy/scipy/issues/10201#issuecomment-582426961 It would be interesting to also have this comparison random sparse arrays to evaluate the impact of shape and density.

vbarrielle commented 4 years ago

That's a problem indeed, thanks for reporting it. I'll try and find some time to investigate.

espdev commented 4 years ago

I have compared sprs and scipy on random sparse matrices with some shapes and density. It is interesting. It seems I incorrectly defined the dependency sprs complexity from nnz. sprs mat*mat complexity directly depends on matrices sizes also. The performance is dramatically reduced while increasing the sizes of the matrices independently from nnz. At the same time in scipy the dependency is quite different.

Here are some plots from my benchmarks.

scipy: scipy_sparse_matmul

sprs sprs_sparse_matmul

scipy vs sprs sprs_scipy_sparse_matmul_0 sprs_scipy_sparse_matmul_1

Benches code can be found at the gist: https://gist.github.com/espdev/f16f139f8007d3b37d1a57835fee6a19

vbarrielle commented 4 years ago

Interesting. Here's the main loop of the matrix multiplication in sprs: https://github.com/vbarrielle/sprs/blob/master/src/sparse/prod.rs#L207-L226

It accumulates each row of the result matrix into a dense vector, before converting it to sparse. That explains the dependency on the dimension (actually, the number of columns). But it makes it more interesting when the matrix is denser.

I've started a branch to implement SMMP. We'll have more information when I'm done implementing it. But your graphs mean it might be interesting to keep the original product implementation around, but perhaps with a modification to sparsify the structure that accumulates the result row.

vbarrielle commented 4 years ago

Progress report: I'm done implementing SMMP symbolic and numeric in the smmp branch. I won't have time to check the performance this weekend, but I should look at it next week.

I also have a request for you, @espdev: would you be ok with contributing your benchmarks in the examples folder of sprs? I'd like to have them around to be able to monitor performance, and I want to be sure you're fine with contributing them under sprs' license.

espdev commented 4 years ago

@vbarrielle

Wow! It's great. Thank you! I will clean and refactor my benchmarks to contribute to sprs examples soon.

vbarrielle commented 4 years ago

Here are the performance results for smmp on your first benchmark:

sprs_vs_scipy

(EDIT: these results have a caveat, I generated them on my unplugged laptop).

Looks good, now I have to take a look at your second benchmark. This will probably show a complexity difference between scipy's implementation of SMMP and mine, I have to sort the indices of the matrices because that's a guarantee of the CsMat class. That makes the complexity O(n_row*K^2 + max(n_row,n_col))*K*log(K). That should not matter for most workloads though.

vbarrielle commented 4 years ago

Here are the results for the random matrices:

sprs_vs_scipy_random_mats

We can see that sprs is lagging behind now, I've profiled it, and most of the time is spent sorting the indices. If I disable indices sorting sprs is more performant: sprs_unsorted_vs_scipy_random_mats

I'm going to investigate alternative indices sorting methods. I can try using an array of booleans if the nnz count is too large.

espdev commented 4 years ago

@vbarrielle

I noticed the sizes of matrices are different in legend text on the graphs:

sprs (15000, 2500)
scipy (15000, 25000)

It looks like it is just a typo.

vbarrielle commented 4 years ago

Indeed, I meant to type (15000, 25000) for both.

vbarrielle commented 4 years ago

Small update, I've figured I could dynamically switch between using the "linked list" indices from SMMP, followed by a sort, and using a linear scan of the encountered nnz indices, according to the expected amount of work.

This makes the performance much better when there is a good amount of non-zeros:

sprs_smart_vs_scipy_random_mats

(EDIT: the legend is false once again, I forgot to remove the (unsorted), it is sorted...)

My test to switch between the two methods is a bit naive right now, but it seems to work well enough. In short, for each row, I compare the amount of elements I need to scan (difference between the max encountered nnz col and the min) versus the amount of work sorting would take (k * log(k) with k the number of nnz in the row. It's possible I should make some timings to determine a more precise bound. There are also a few optimizations that could be done for the linear scan, as it could take advantage of SIMD.

vbarrielle commented 4 years ago

Quick mention that sparse product performance has been further improved in #201.

sparsemat / sprs

Quadratic complexity of sparse matrices multiplication #184