rgl-epfl / cholespy

An easily integrable Cholesky solver on CPU and GPU
BSD 3-Clause "New" or "Revised" License
210 stars 18 forks source link

add optional argument for specifying factorization strategy #31

Open samuelpmish opened 5 months ago

samuelpmish commented 5 months ago

Hi,

After reading about the significant performance difference between simplicial and supernodal factorization (https://github.com/rgl-epfl/cholespy/issues/25), I thought it would be beneficial to expose that as an option to the user. This PR adds a new argument to the CholeskySolver class to specify which option to use. CHOLMOD also apparently has an "AUTO" option which uses some heuristic to decide which strategy is appropriate for the provided matrix, so I use that as the default argument.

Since the new argument is defaulted, this change shouldn't break existing users' code, although it is potentially a change in the default behavior of CHOLMOD, which could change the runtime of existing code.

The performance of two test problems (as measured on a M2 macbook) are summarized below,

  1. get_icosphere(7) (2D connectivity): supernodal was a 1.7x speedup over simplicial
  2. a tetrahedron mesh of a sphere (3D connectivity): supernodal was a 4.1x speedup over simplicial

In both cases, the default "CHOLMOD_AUTO" option picked the faster (supernodal) approach.

The small performance test script (and relevant data for the tetrahedron mesh) is available here.

samuelpmish commented 5 months ago

Oh, also when I ran pytest test_cholesky.py, only a small fraction of the tests were not skipped. Those tests that ran passed, but I can't verify that the remaining tests are working correctly.

wjakob commented 5 months ago

A first question: enabling a supernodal decomposition on the CPU makes sense, but what does this mean for the CUDA implementation? I suppose that the kernels would need to be updated to deal with this case. Does it even make sense to use such a strategy on the GPU?

samuelpmish commented 5 months ago

Enabling a supernodal decomposition on the CPU makes sense, but what does this mean for the CUDA implementation? ... Does it even make sense to use such a strategy on the GPU?

From the the original talk about GPU-accelerated CHOLMOD, it sounds like supernodal was the only supported factorization strategy. I'm not sure if that's still true today or not. If so, I wonder what CHOLMOD does if you tell it to use the GPU and also specify the simplicial strategy.

I suppose that the kernels would need to be updated to deal with this [supernodal option].

Which kernels need updating?

samuelpmish commented 5 months ago

When I run pytest test_cholesky.py on a machine with an NVIDIA GPU (where 17/20 of the tests are not skipped, and run successfully), nvprof reveals that none of the linear algebra in cholespy is using the GPU, except for the triangular solves.

CHOLMOD's recent user guide says that only cholmod_l_factorize (for 64-bit integers) supports GPU acceleration, but cholespy uses the 32-bit integer variant which doesn't seem to take advantage of GPU hardware:

https://github.com/rgl-epfl/cholespy/blob/acb3342c31bbb9c15955494b733eca9d64d3aac0/src/cholesky_solver.cpp#L236-L238

am I goofing something up?