Parallel smmp - Githubissues

This pull request improves the performance of the sparse matrix product by taking advantage of multiple cores, thanks to rayon. Here are benchmark results demonstrating the improvements (benchmarked on a AMD Ryzen 7 3700X 8-Core Processor this time):

sparse_mult_perf_by_shape_no_old

Here's a description of the entries:

the red line corresponds to the SMMP algorithm as implemented in #187.
the pink line (boolvec 1T) is a micro-optimization of the SMMP algorithm, using a Vec<bool> in the symbolic part of the algorithm to check if a column has been seen, which is more cache friendly than the index based "linked list" in the original SMMP algorithm
boolvec 2T and 4T are the same as boolvec 1T, but with the workloads for the symbolic and numeric parts divided on respectively 2 and 4 threads.
boolvec auto dynamically chooses the number of threads, to avoid the threading overhead for small nonzero counts, and to be able to use all CPU cores if needed. This is the most efficient strategy and thus the default, though it can be configured. There may be some tuning to do in the thresholds to pick the number of threads.

As before, with no multithreading, it takes a very large matrix to beat scipy's implementation (mostly because we need to sort the indices whereas scipy does not guarantee sorted indices). However, as far as I know scipy's matrix product is not parallelized, so taking advantage of multiple cores gives sprs an advantage.

On smaller shapes it takes a lot of threads to beat scipy (here the shape is 15000 x 25000):

sparse_mult_perf_150000_25000

I'm not entirely satisfied with the current multithreading implementation, dividing the work into independent chunks has been a bit clunky, but I'm sure it could be done in a cleaner way. That can wait though, the current state is correct and is a nice milestone.

It's quite possible some performance can be gained in the future, as currently the heuristic to divide the workload for the symbolic part has not been extensively tested, so some threads could be starving. A simple improvement could be to delay the sorting of indices, to know the actual number of nonzeros of the result and be able to divide the sorting workload evenly.

sparsemat / sprs

Parallel smmp #201