Performance Issues with `sprs-ldl`.

PTNobel commented 4 years ago

I've been using sprs-ldl to solve some symmetric sparse matrix systems and found the performance to be surprisingly poor. lsolve_csc_dense_rhs has had significantly better performance.

Here is the benchmark I've been using: https://github.com/PTNobel/sprs-performance-test

which produced the following output:

LdlNumeric took 5775ms to solve 3 sparse systems
lsolve_csc_dense_rhs took 11ms to solve 3 sparse systems

I'm happy to answer more questions if needed, but that'll probably have to come on the weekend.

This issue was filed per https://www.reddit.com/r/rust/comments/gzazna/whats_the_current_state_of_rusts_sparse_linear/fv9hf0e/?context=3

rth commented 4 years ago

Thanks for the report @PTNobel ! Similarly to https://github.com/vbarrielle/sprs/issues/188 it would probably be nice to adapt your code to add it to the benchmarks as a starting point. I'm not familiar with the implementation so can't comment on the reasons behind this difference in performance. Someone would probably need to profile it to see what is happening..

vbarrielle commented 4 years ago

Hello @PTNobel and thanks for the report. I'm afraid the performance of sprs-ldl is not that good right now, for several reasons:

I have not done much code-specific optimizations yet, meaning there may be some hot spots in the implementation. It would be interesting to also benchmark against https://crates.io/crates/sprs_suitesparse_ldl which is a wrapper around a state of the art Cholesky factorization.
Cholesky factorization alone is not enough for solving a sparse system efficiently, because a factorization is bound to create some fill-in which can sometimes dramatically increase the number of nonzero values. Therefore, a fill-in reduction method must be used. Currently, only reverse Cuthil-McKee ordering has been implemented (see #182 and #186), which is a start, but far from state of the art. An implementation of Approximate Minimum Degree and/or Nested Dissection would be good, but so far I've been unable to implement these. These algorithms are significantly more complex to implement and I currently don't have the time to try and implement them. Maybe binding their suitesparse implementation could be the way. It would be however interesting to see how using reverse Cuthill-McKee would fare in your benchmark. This is not yet published on crates.io though, only on the master branch.
There is a symmetry check in the algorithm, if you're confident your system matrix is symmetric this could be skipped (though the API to do that is probably inconvenient right now).

Something troubles me about your benchmark though: it is expected that lsolve_csc_dense_rhs should be 2 to 3 times faster than a solve with LDL on an already factorized system (because LDL actually does two triangular solves and a diagonal solve). However lsolve_csc_dense_rhs requires its input to be lower triangular, and does not check. So it's quite possible the solve results in your benchmark are quite different.

Another thing about the benchmark: it includes the cost to convert from triplet format to csc, but it would probably be better to avoid counting this as solve time.

I should investigate this once I'm done improving the matrix product performance (currently investigating parallelization opportunities). Please note this can take some time, I don't have much free time at the moment. Any help is welcome of course :)

Thanks for the report!

PTNobel commented 4 years ago

@vbarrielle, I think it sounds like the lowest hanging fruit here is to

1) move the csc transforms out of the timed section 2) Verify that the two solutions have the same results 3) Add the sprs_suitsparse_ldl as a test against 4) Apply Reverse Cuthill-McKee as another test 5) Incorporate the tests into the sprs benchmark framework and merge into here. 6) Figure out how to turn off the symmetry test

vbarrielle commented 4 years ago

Hello, just a note that I can start investigating now that #201 is done.

PTNobel commented 3 years ago

Anything you want me to do? I'm happy to try and work through my list of items from my last comment.

vbarrielle commented 3 years ago

@PTNobel I've checked 1, and the csc transform does not take too much time so it does not matter if it stays in the timed section.

I'm going to focus on 4 and 6, which mostly requires that I publish a new version of sprs with Cuthill-McKee exposed, once it's done I can enhance the API of sprs-ldl to let the caller ask for a fill-in reduction method, and disable the symmetry check.

So if you want you can look into 2, 3 or 5. Thanks for proposing your help by the way.

vbarrielle commented 3 years ago

I wanted to have comparison points with other factorization algorithms, so I serialized the first jacobian in your example and loaded it in scipy.

There I've been able to test LU factorization (scipy uses the library SuperLU under the hood):

In [23]: %timeit scipy.sparse.linalg.splu(jaco)
1 loop, best of 5: 1.9 s per loop

I've also timed the CHOLMOD algorithm, which is probably the best library for Cholesky decomposition. I've had to add a diagonal offset to avoid an error signaling the matrix is not positive definite:

In [20]: %timeit sksparse.cholmod.cholesky(jaco, 100.)
1 loop, best of 5: 462 ms per loop

For the record, on my machine, using Reverse Cuthill-McKee, I get the following timing on your benchmark:

LdlNumeric took 4379ms to solve 3 sparse systems

which would mean about 1.46s per solve but this is not an accurate comparison as the symbolic part is re-used but the backsolve is included.

In general, the number that governs the performance of the factorization is the fill-in, ie the number of additional nonzeros created in the factorization. The jacobian has 41818 nonzeros. Factoring with sprs-ldl gives 1654098 nonzeros, which can be reduced to 1498310 nonzeros with Reverse Cuthill-McKee. SuperLU has 1354237 nonzeros in its L factor, and 1354558 in its U factor. CHOLMOD has 1168635 nonzeros.

So part of the better performance in CHOLMOD is probably explained by its usage of a better fill-in reducing permutation (it probably uses AMD). It's also an extremely well tuned library.

However I find the fill-in quite big, even for CHOLMOD. But I notice the graph is wired randomly, and I'm pretty sure fill-in reducing permutations work by discovering structure in the nonzero graph, so that means this is quite a pathological example. Maybe we can expect better numbers if the graph is for instance a regular grid, or a surface mesh. That's something to try.

PTNobel commented 3 years ago

But I notice the graph is wired randomly

I'll add, that this is because my use-case encounters genuinely random graphs quite frequently sadly, the more structure present, the work we're doing is less useful.

PTNobel commented 3 years ago

So as you suspected, the results being returned are wildly different suggesting that lsolve_csc_dense_rhs was fast because it is wrong... I'll push this code to my repo tonight resolving 2.

PTNobel commented 3 years ago

Question:

What is the difference between LdlLongNumeric and LdlNumeric in sprs_suitesparse_ldl?

vbarrielle commented 3 years ago

The difference comes from the type of the integer type that the underlying C library expects for the storage of the indices and indptr of the system matrix, the former expects a libc::c_long while the latter expects a libc::c_int. This mostly matters if your number of rows/columns if bigger than what can fit in an integer. Here it shouldn't matter. You probably need to call CsMat::to_other_types to convert to the correct C types.

vbarrielle commented 3 years ago

Latest micro-optimizations in sprs-ldl (see #207) have put its performance on-par with the original C implementation of LDL (see https://github.com/PTNobel/sprs-performance-test/pull/2).

I guess the way to go now would be to have a better fill-in reducing permutation, which will require binding to those available in suitesparse, or implementing one of them.

I'll probably look into the former in the near future.

vbarrielle commented 3 years ago

The binding of SuiteSparse's CAMD has enabled some more performance gains, and, as mentionned in https://github.com/PTNobel/sprs-performance-test/pull/2#issuecomment-706390315, I don't think there's much more improvements to be hoped for using LDL. A good way to improve performance would be to have a supernodal Cholesky decomposition (like CHOLMOD), or a multifrontal one to exploit parallelism. I think the latter is a bit more accessible than the former, so I'll probably have a look into it at some point in the future, but it's probably going to take a while, as my free time is scarce these days.

rwl commented 2 years ago

An implementation of Approximate Minimum Degree and/or Nested Dissection would be good...

I published a pure Rust translation of AMD:

https://crates.io/crates/amd

It works well with this solver:

https://crates.io/crates/rlu

mulimoen commented 2 years ago

@rwl That seems like it could fit very well with this crate!

rwl commented 2 years ago

I also translated QDLDL from the OSQP solver project:

https://crates.io/crates/ldl

It is published under the Apache 2.0 license and could also be a good fit.

sparsemat / sprs

Performance Issues with `sprs-ldl`. #199